Over a million developers have joined DZone.

MongoDB - Limiting Collection Size

· Big Data Zone

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

Recently I had a small task: storing some client events in MongoDb. This was supposed to be a separate collection, with just one simple requirement: we only needed to keep last 100 events per client. Now, when you think about limiting the collection size in MongoDb, the first thing that comes to mind is capped collections.
  • Capped collections are collections the size of which is limited. That is, it will always have the size you set up for it: for example, 1000 records. When the allocated collection space is full, it starts to write new records over the old ones. There are some limitations, though. For example, a capped collection can't be sharded (but usually you don't need this, as it's used to keep the kind of info you don't need too much of). Also, you can't delete from them, and you can't make updates that increase the document size. The last restriction comes from the guarantee that the collection always preserves its natural order, that is, the documents are stored on disk in the order you created them. More on capped collections here: MongoDB documentation.
Capped collections are a useful feature, they're very fast and save some effort. Unfortunately, we couldn't use them, because we needed to limit the records per client, not the total collection size. There's another useful feature that can help when you don't need to store old records: TTL.
  • TTL, or time to live, is a collection property that is set by a special kind of index. The index should be created on a date BSON field and have expireAfterSeconds property set. This property is the time (in seconds) each record will be kept. Such command will create a TTL collection out of one: 

    db.clientEvents.ensureIndex( { "createDate": 1 }, { expireAfterSeconds: 3600 } )

    This collection will always keep just one hour's worth of data. The rest will be deleted in a special background job. From this, one of the restriction comes, namely, that TTL collections can't be capped (because we remember one can't delete from those, right?). You can read about the other limitations here:
    MongoDb expire data tutorial
TTL collections are good but still weren't good enough for us. We could use them if we wanted to store, for example, only last month's data. But we wanted a fixed number of records per client, so... we kept thinking. The next thing that comes to mind is to invent a wheel. As in, create a home-made solution. The simple variant is:
  • When a record is saved, check how many records client already has.
  • If it's more than 100, remove the oldest one.
  • Then, insert our new one.
This is all fine, but in the worst case we would have +3 operations for each save:
  • Get record count for the client;
  • Find the oldest record for the client;
  • Remove it.
So, I needed to do better. And then the question was, can we have more than 100 records per client for some time? The answer was, sure we can, we can only show the last 100 records to the client. Still, we don't need too much old data, so it should be cleaned up sometimes. So, how often is sometimes? What if we do the cleanup once for every 50 insertions? Then we should always have between 100 and 150 records per client, that doesn't take up too much space, and we show only the last 100? Sure, and this is how it goes:
  • On save, we generate a random number between 1 and 50;
  • If the random is say 12 (or 25 or whatever), we get the last 100 records per client;
  • And we remove all the client events that have IDs NOT in this result set.
This is still more operations than we would like to have, but it only happens once in 50 times, so we decided to go with it and see how it works. Hopefully it will!

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and why your current integration solution is not enough, brought to you in partnership with Liaison


Published at DZone with permission of Maryna Cherniavska. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}