Over a million developers have joined DZone.

MongoDB Indexing Tip #3: How to Deal with Too Many Fields

· Java Zone

Navigate the Maze of the End-User Experience and pick up this APM Essential guide, brought to you in partnership with CA Technologies

[Editor's note: This is part 3 of a series of tips for MongoDB. Check out tip #1 and tip #2 as well.]

The Problem

There are situations where your documents have many different fields and you want to be able to query efficiently on any of them. Say you have a document describing a person:

{
    _id: 123,
    firstName: "John",
    lastName: "Smith",
    age: 25,
    height: 6.0,
    dob: Date,
    eyes: "blue",
    sign: "Capricorn",
    ...
}

Then you may want to find all people with blue eyes, a certain height, by last name, etc. Say you have dozens of such properties, or you may not even know in advance what they will be, or they vary on a per document basis… How can you use indexing to quickly resolve these queries? Obviously it would be expensive and impractical to create an index on each of those fields.

Solution #1: Compound Index on Name and Value

Let’s start with the schema design and leverage the power of JSON by using a list to store all the properties:

{
    _id: 123,
    props: [
    { n: "firstName", v: "John"},
    { n: "lastName", v: "Smith"},
    { n: "age", v: 25},
    ...
    ]
}

The index to create here is a compound on the name and value fields within the list. To illustrate this, let’s create millions of documents with dummy properties (“prop0” through “prop9”) that take a random value between 0 and 1000.

> for (var i = 0; i < 5000000; ++i) { var arr = []; for (var j = 0; j < 10; ++j) { arr.push({n: "prop" + j, v: Math.floor(Math.random() * 1000) }) }; db.generic.insert({props: arr}) }
> db.generic.findOne()
{
  "_id": ObjectId("515dd3b4f0bd676b816aa9b0"),
  "props": [
    {
      "n": "prop0",
      "v": 40
    },
    {
      "n": "prop1",
      "v": 198
    },
...
    {
      "n": "prop9",
      "v": 652
    }
  ]
}
> db.generic.ensureIndex({"props.n": 1, "props.v": 1})
> db.generic.stats()
{
  "ns": "test.generic",
  "count": 5020473,
  "size": 1847534064,
  "avgObjSize": 368,
  "storageSize": 2600636416,
  "numExtents": 19,
  "nindexes": 2,
  "lastExtentSize": 680280064,
  "paddingFactor": 1,
  "systemFlags": 1,
  "userFlags": 0,
  "totalIndexSize": 1785352240,
  "indexSizes": {
    "_id_": 162898624,
    "props.n_1_props.v_1": 1622453616
  },
  "ok": 1
}

As you can see the index size is quite large at 1.6GB, since we are storing both the name of the property and its value in the index. That is the price of getting a generic index! Now on to querying… Let’s find documents where “prop1” is 0:

> db.generic.findOne({"props.n": "prop1", "props.v": 0})
{
  "_id": ObjectId("515dd4298bff7c34610f6ae8"),
  "props": [
    {
      "n": "prop0",
      "v": 788
    },
    {
      "n": "prop1",
      "v": 0
    },
...
    {
      "n": "prop9",
      "v": 788
    }
  ]
}
> db.generic.find({"props.n": "prop1", "props.v": 0}).explain()
{
  "cursor": "BtreeCursor props.n_1_props.v_1",
  "isMultiKey": true,
  "n": 49822,
  "nscannedObjects": 5020473,
  "nscanned": 5020473,
  "nscannedObjectsAllPlans": 5020473,
  "nscannedAllPlans": 5020473,
  "scanAndOrder": false,
  "indexOnly": false,
  "nYields": 0,
  "nChunkSkips": 0,
  "millis": 252028,
  "indexBounds": {
    "props.n": [
      [
        "prop1",
        "prop1"
      ]
    ],
    "props.v": [
      [
        {
          "$minElement": 1
        },
        {
          "$maxElement": 1
        }
      ]
    ]
  },
  "server": "agmac.local:27017"
}

This does not give the expected result: it matches 50k records and takes 252s! The reason is that, as per the query language, n=”prop1” and v=0 do not need to be within the same subdocument as long as they are within the same document. Basically it considers all combinations of “n” and “v” within a document and matches more than it should. While you can debate this choice for the query language, the way to force a specific subdocument to match is to use “$elemMatch”:

> db.generic.findOne({"props": { $elemMatch: {n: "prop1", v: 0} }})

Now, how is the index used and how long does it take MongoDB v2.2 to find the documents? let’s see:

> db.generic.find({"props": { $elemMatch: {n: "prop1", v: 0} }}).explain()
{
  "cursor": "BtreeCursor props.n_1_props.v_1",
  "isMultiKey": true,
  "n": 5024,
  "nscannedObjects": 5020473,
  "nscanned": 5020473,
  "nscannedObjectsAllPlans": 5020473,
  "nscannedAllPlans": 5020473,
  "scanAndOrder": false,
  "indexOnly": false,
  "nYields": 0,
  "nChunkSkips": 0,
  "millis": 278784,
  "indexBounds": {
    "props.n": [
      [
        "prop1",
        "prop1"
      ]
    ],
    "props.v": [
      [
        {
          "$minElement": 1
        },
        {
          "$maxElement": 1
        }
      ]
    ]
  },
  "server": "agmac.local:27017"
}

It now returns the proper 5024 documents… But the result is just as slow! As you can see in the explain output, the reason is that the range on the “v” field is still open-ended. Why is that? Let’s back up for a second: if not using $elemMatch, then all combinations of fields can be a match. It would be impossible to build an index to back it up since there would be way too many combinations. So the choice was made for MongoDB to put the values of a subdocument in the same btree bucket and to ignore combinations (basically the behavior of $elemMatch). But so why is the $elemMatch query still slow then? That is due to a shortcoming that got fixed in v2.4, see SERVER-3104. Upgrading to 2.4, you will see this:

> db.generic.find({"props": { $elemMatch: {n: "prop1", v: 0} }}).explain()
{
  "cursor": "BtreeCursor props.n_1_props.v_1",
  "isMultiKey": true,
  "n": 5024,
  "nscannedObjects": 5024,
  "nscanned": 5024,
  "nscannedObjectsAllPlans": 5024,
  "nscannedAllPlans": 5024,
  "scanAndOrder": false,
  "indexOnly": false,
  "nYields": 0,
  "nChunkSkips": 0,
  "millis": 21,
  "indexBounds": {
    "props.n": [
      [
        "prop1",
        "prop1"
      ]
    ],
    "props.v": [
      [
        0,
        0
      ]
    ]
  },
  "server": "agmac.local:27017"
}

Alright we are down to 21 milliseconds, that’s more like it!! To do “and”/”or” queries use the $all/$in operators respectively. Note that in the case of $all only the 1st item will be used to lookup in the btree, so put the most restrictive value first if you know it.

db.generic.find({"props": { $all: [{ $elemMatch: {n: "prop1", v: 0} },{ $elemMatch: {n: "prop2", v: 63} } ]}})

One caveat with this solution: range queries on the property value do not restrict the index bounds properly and ends up scanning way too much. This bug SERVER-10436 will hopefully be fixed in v2.6:

> db.generic.find({ props: { $elemMatch: {n: "prop1", v: { $gte: 6, $lte: 9 } }}}).explain()
{
    "cursor" : "BtreeCursor props.n_1_props.v_1",
	"isMultiKey" : true,
	"n" : 506,
	"nscannedObjects" : 126571,
	"nscanned" : 126571,
	"nscannedObjectsAllPlans" : 126571,
	"nscannedAllPlans" : 126571,
	"scanAndOrder" : false,
	"indexOnly" : false,
	"nYields" : 1,
	"nChunkSkips" : 0,
	"millis" : 1396,
	"indexBounds" : {
		"props.n" : [
			[
				"prop1",
				"prop1"
			]
		],
		"props.v" : [
			[
				6,
				1.7976931348623157e+308
			]
		]
	},
	"server" : "agmac.local:27017"
}

Solution #2: Single BLOB Index

Another approach to the problem is to just put the “property: value” combination as-is inside a subobject of a list. This solution works both with v2.2 and v2.4. Let’s create documents like this:

> for (var i = 0; i < 5000000; ++i) { var arr = []; for (var j = 0; j < 10; ++j) { var doc = {}; doc["prop" + j] =  Math.floor(Math.random() * 1000); arr.push(doc) }; db.generic2.insert({props: arr}) }
> db.generic2.findOne()
{
  "_id": ObjectId("515e5e6a71b0722678929760"),
  "props": [
    {
      "prop0": 881
    },
    {
      "prop1": 47
    },
...
    {
      "prop9": 717
    }
  ]
}

The index should be on the list itself since the property name varies:

> db.generic2.ensureIndex({props: 1})
> db.generic2.stats()
{
  "ns": "test.generic2",
  "count": 5000000,
  "size": 1360000032,
  "avgObjSize": 272.0000064,
  "storageSize": 1499676672,
  "numExtents": 19,
  "nindexes": 2,
  "lastExtentSize": 393670656,
  "paddingFactor": 1,
  "systemFlags": 1,
  "userFlags": 0,
  "totalIndexSize": 2384023488,
  "indexSizes": {
    "_id_": 162269072,
    "props_1": 2221754416
  },
  "ok": 1
}

As you can see the index is even larger than solution #1 by almost 30%, since the BSON subdocuments themselves are stored in the index as BLOBs. Now on to querying:

> db.generic2.find({"props": {"prop1": 0} }).explain()
{
  "cursor": "BtreeCursor props_1",
  "isMultiKey": true,
  "n": 4958,
  "nscannedObjects": 4958,
  "nscanned": 4958,
  "nscannedObjectsAllPlans": 4958,
  "nscannedAllPlans": 4958,
  "scanAndOrder": false,
  "indexOnly": false,
  "nYields": 0,
  "nChunkSkips": 0,
  "millis": 15,
  "indexBounds": {
    "props": [
      [
        {
          "prop1": 0
        },
        {
          "prop1": 0
        }
      ]
    ]
  },
  "server": "agmac.local:27017"
}

The result is even faster than solution #1 at 15ms! But one caveat is that the predicate must always use a full JSON object. To match “prop1” between 0 and 9, the query would be:

> db.generic2.find({"props": { $elemMatch: { $gte: {"prop1": 0}, $lte: {"prop1": 9} }})

If there are other fields in the subobject, those must be part of the JSON predicate when matching (remember that the subobject is just a BLOB for MongoDB). Now say you want an open range where “prop1” exists and is greater than 6, you still should specify an upper bound otherwise it will match many more documents than expected. Ideally you could use MaxKey as upper bound but I have discovered a bug SERVER-10394 whereby a proper bound of the same type must be specified, for example a high value integer:

db.generic2.find({"props": { $elemMatch: {$gte: {"prop1": 6}, $lt: {"prop1": 99999999 } }}})

One caveat: you cannot index independently just the values. For example with solution #1, if you want to find any document that has a value “10” for any property, you can just create an index on “props.v”. This is not possible with solution #2 since the field name varies.

Conclusion

As a conclusion, you can see that MongoDB v2.4 now offers a straightforward and efficient way to build a generic index over many properties. You are now free to index and query all of the many properties of your Big Data project :)

Thrive in the application economy with an APM model that is strategic. Be E.P.I.C. with CA APM.  Brought to you in partnership with CA Technologies.

Topics:

Published at DZone with permission of Antoine Girbal, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}