MongoDB Indexing Tip #1: Find Your Friends' Recent Activity
Join the DZone community and get the full member experience.
Join For FreeProblem
A very common feature of any social site (and other types of application) is to find the latest activity of one’s friends. Another related example is to fetch the latest data for people or topics that one follows. You may think that databases would make it easy to implement such a feature, but unfortunately it is far from being easy.
Many new NoSQL stores do not really support secondary indexes, which are necessary for this feature. In the case of MongoDB it is possible to do it efficiently but it presents many pitfalls.
Fetching the Friends’ ids
The first step is to fetch the list of the user’s friends ids. The friendship relationship is typically either embedded into the user document (as a list of user ids) or in a mapping table (SQL style). In both cases it is assumed the application can quickly put together the list of friends’ ids.
The Ideal Query
The intuitive query should look like this:
db.posts.find({ userId: { $in: [12, 24, 56, …] } }).sort({ date: -1 }).limit(100)
Notes:
- it is much better to use “$in“ rather than “$or“. The latter does not use indexes properly when combined with a sort.
- the sort is for descending time, to obtain the most recent
- the limit tells that we are only interested in the last 100 documents.
For such a query, the best index is a compound “{ userId: 1, date: 1}“. Note that it is not important whether it is ascending or descending on the date.
What is expected? Ideally, MongoDB will take the following steps:
- go to each friend’s branch of the index.
- pick the 100 most recent documents within each, which is fast thanks to the compound on “date“.
- apply an in-memory sort on those documents (e.g. merge 1000 documents for 10 friends, which is fast) and then return the most recent 100.
The following diagram shows what should ideally happen.
The “Limit“ Problem
One debatable design decision was to not pass the “limit“ value into the query protocol. Instead only the “batchSize“ is communicated, which tells the server how many results to return in the next batch but omits to say that no more results are needed.
If you set a limit of 100, then the “batchSize“ gets set to 100 but the server assumes you may ask for more than 100 records, and keeps the cursor open on the server. In turn it makes it impossible for the server to reduce the amount of records sorted: it must sort in-memory *all* your friends’ documents *since the beginning of times*.
The fix is to use a negative limit: if set to -100 then it will be passed to the server as a negative “batchSize“ value. It tells the server: send me only 1 batch with a maximum of 100 documents and do not keep the cursor open. One caveat to this solution is that a batch cannot exceed 16MB, so the 100 documents must fit within 16MB (which is likely).
The Problem with MongoDB 2.0
With MongoDB 2.0 (which hopefully you’re not using anymore at this time) the behavior still won’t be the one expected. It will just continue to disregard the limit and always resort to sort all documents since day 1. As a consequence the queries will get slower and slower over time, which is a big no-no.
The following diagram shows the bad behavior.
The Limited Fix with MongoDB 2.2 and 2.4
A limited fix was implemented in 2.2. The following query will work as expected and return within milliseconds:
db.posts.find({ userId: { $in: [12, 24, 56, …] } }).sort({ date: -1 }).limit(-100)
So you may think that it is all good … unfortunately a big limitation is that any extra predicate may turn off the optimization. Any extra predicate must be an equality (or a “$in“) on a field that is part of the index to the left side of the field sorted on (“date“ in this case). If you want to only see posts that have specific properties (e.g. “private: False“) you must be very careful how you build the index:
db.posts.find({ userId: { $in: [12, 24, 56, …], private: False } }).sort({ date: -1 }).limit(-100)
It must use an index on “{ userId: 1, private: 1, date: 1 }“. A different type of predicate, or against a field not covered by the correct portion of the index will put you back in the bad slow case.
This issue will be fixed as part of the 2.6 index internal API refactoring, which you should look forward to!
A Solution: Add a Predicate on Sorted Field
One solution is to also specify a date range, which will properly limit how much data gets sorted.
For example:
db.posts.find({ userId: { $in: [12, 24, 56, …] }, date: { $gt: x, $lt: y } }).sort({ date: -1 }).limit(-100)
The obvious problem is that you don’t know what date range will yield enough / too much results… You can first query for the past day, then go back in time with increasing ranges if not enough data is returned.
A Solution: Multiple Queries
To properly make full use of the index, a solution is to split the query into many queries, one per friend id. In that case MongoDB can properly iterate through the sorted index without doing extra work.
The queries would be:
db.posts.find({ userId: 12, … } }).sort({ date: -1 }).limit(-100)
db.posts.find({ userId: 24, … } }).sort({ date: -1 }).limit(-100)
…
The client code will then do the sort of the separate results, which is fast to do. Any predicate can be added to filter the data further, and should be part of the right side of the “date“ in the index.
Unfortunately this solution only works well if the number of friends is fairly low. Each query should return within a millisecond, but with 100 friends you can be looking at several milliseconds already. The good news is that the speed will remain the same over time, but there is some extra overhead for sure. Overall it is a good solution if the number of friends under 100 on average.
A Solution: Fan Out
An alternative strategy is to duplicate the data into each of the friend’s own data. When a user posts an update, it is duplicated into 1 document for each of the friends. Consequently the problematic query becomes easy:
db.posts.find({ userId: 7 }).sort({ date: -1 }).limit(-100)
The obvious downside is the large duplication of content. If a user has 10k followers, the data will generate 10k documents written to the database.
Conclusion
In conclusion, either:
- use the 2.2 optimization but be very careful about its implementation
- use one of the alternative solutions above… each has pros / cons.
Expect MongoDB 2.6 to give you full intuitive support for this! One ticket to watch is SERVER-3310.
Published at DZone with permission of Antoine Girbal, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
WireMock: The Ridiculously Easy Way (For Spring Microservices)
-
How to Optimize CPU Performance Through Isolation and System Tuning
-
HashMap Performance Improvements in Java 8
-
Which Is Better for IoT: Azure RTOS or FreeRTOS?
Comments