DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

Trending

  • WireMock: The Ridiculously Easy Way (For Spring Microservices)
  • How to Optimize CPU Performance Through Isolation and System Tuning
  • HashMap Performance Improvements in Java 8
  • Which Is Better for IoT: Azure RTOS or FreeRTOS?
  1. DZone
  2. Data Engineering
  3. Databases
  4. MongoDB Indexing Tip #1: Find Your Friends' Recent Activity

MongoDB Indexing Tip #1: Find Your Friends' Recent Activity

Antoine Girbal user avatar by
Antoine Girbal
·
Oct. 31, 13 · Interview
Like (0)
Save
Tweet
Share
10.40K Views

Join the DZone community and get the full member experience.

Join For Free

Problem

A very common feature of any social site (and other types of application) is to find the latest activity of one’s friends. Another related example is to fetch the latest data for people or topics that one follows. You may think that databases would make it easy to implement such a feature, but unfortunately it is far from being easy.

Many new NoSQL stores do not really support secondary indexes, which are necessary for this feature. In the case of MongoDB it is possible to do it efficiently but it presents many pitfalls.

Fetching the Friends’ ids

The first step is to fetch the list of the user’s friends ids. The friendship relationship is typically either embedded into the user document (as a list of user ids) or in a mapping table (SQL style). In both cases it is assumed the application can quickly put together the list of friends’ ids.

The Ideal Query

The intuitive query should look like this:

db.posts.find({ userId: { $in: [12, 24, 56, …] } }).sort({ date: -1 }).limit(100)

Notes:

  • it is much better to use “$in“ rather than “$or“. The latter does not use indexes properly when combined with a sort.
  • the sort is for descending time, to obtain the most recent
  • the limit tells that we are only interested in the last 100 documents.

For such a query, the best index is a compound “{ userId: 1, date: 1}“. Note that it is not important whether it is ascending or descending on the date.

What is expected? Ideally, MongoDB will take the following steps:

  • go to each friend’s branch of the index.
  • pick the 100 most recent documents within each, which is fast thanks to the compound on “date“.
  • apply an in-memory sort on those documents (e.g. merge 1000 documents for 10 friends, which is fast) and then return the most recent 100.

The following diagram shows what should ideally happen.

image

The “Limit“ Problem

One debatable design decision was to not pass the “limit“ value into the query protocol. Instead only the “batchSize“ is communicated, which tells the server how many results to return in the next batch but omits to say that no more results are needed.

If you set a limit of 100, then the “batchSize“ gets set to 100 but the server assumes you may ask for more than 100 records, and keeps the cursor open on the server. In turn it makes it impossible for the server to reduce the amount of records sorted: it must sort in-memory *all* your friends’ documents *since the beginning of times*.

The fix is to use a negative limit: if set to -100 then it will be passed to the server as a negative “batchSize“ value. It tells the server: send me only 1 batch with a maximum of 100 documents and do not keep the cursor open. One caveat to this solution is that a batch cannot exceed 16MB, so the 100 documents must fit within 16MB (which is likely).

The Problem with MongoDB 2.0

With MongoDB 2.0 (which hopefully you’re not using anymore at this time) the behavior still won’t be the one expected. It will just continue to disregard the limit and always resort to sort all documents since day 1. As a consequence the queries will get slower and slower over time, which is a big no-no.

The following diagram shows the bad behavior.

image

The Limited Fix with MongoDB 2.2 and 2.4

A limited fix was implemented in 2.2. The following query will work as expected and return within milliseconds:

db.posts.find({ userId: { $in: [12, 24, 56, …] } }).sort({ date: -1 }).limit(-100)

So you may think that it is all good … unfortunately a big limitation is that any extra predicate may turn off the optimization. Any extra predicate must be an equality (or a “$in“) on a field that is part of the index to the left side of the field sorted on (“date“ in this case). If you want to only see posts that have specific properties (e.g. “private: False“) you must be very careful how you build the index:

db.posts.find({ userId: { $in: [12, 24, 56, …], private: False } }).sort({ date: -1 }).limit(-100)

It must use an index on “{ userId: 1, private: 1, date: 1 }“. A different type of predicate, or against a field not covered by the correct portion of the index will put you back in the bad slow case.

This issue will be fixed as part of the 2.6 index internal API refactoring, which you should look forward to!

A Solution: Add a Predicate on Sorted Field

One solution is to also specify a date range, which will properly limit how much data gets sorted.

For example:

db.posts.find({ userId: { $in: [12, 24, 56, …] }, date: { $gt: x, $lt: y } }).sort({ date: -1 }).limit(-100)

The obvious problem is that you don’t know what date range will yield enough / too much results… You can first query for the past day, then go back in time with increasing ranges if not enough data is returned.

A Solution: Multiple Queries

To properly make full use of the index, a solution is to split the query into many queries, one per friend id. In that case MongoDB can properly iterate through the sorted index without doing extra work.

The queries would be:

db.posts.find({ userId: 12, … } }).sort({ date: -1 }).limit(-100)

db.posts.find({ userId: 24, … } }).sort({ date: -1 }).limit(-100)

…

The client code will then do the sort of the separate results, which is fast to do. Any predicate can be added to filter the data further, and should be part of the right side of the “date“ in the index.

Unfortunately this solution only works well if the number of friends is fairly low. Each query should return within a millisecond, but with 100 friends you can be looking at several milliseconds already. The good news is that the speed will remain the same over time, but there is some extra overhead for sure. Overall it is a good solution if the number of friends under 100 on average.

A Solution: Fan Out

An alternative strategy is to duplicate the data into each of the friend’s own data. When a user posts an update, it is duplicated into 1 document for each of the friends. Consequently the problematic query becomes easy:

db.posts.find({ userId: 7 }).sort({ date: -1 }).limit(-100)

The obvious downside is the large duplication of content. If a user has 10k followers, the data will generate 10k documents written to the database.

Conclusion

In conclusion, either:

  • use the 2.2 optimization but be very careful about its implementation
  • use one of the alternative solutions above… each has pros / cons.

Expect MongoDB 2.6 to give you full intuitive support for this! One ticket to watch is SERVER-3310.

MongoDB Database

Published at DZone with permission of Antoine Girbal, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • WireMock: The Ridiculously Easy Way (For Spring Microservices)
  • How to Optimize CPU Performance Through Isolation and System Tuning
  • HashMap Performance Improvements in Java 8
  • Which Is Better for IoT: Azure RTOS or FreeRTOS?

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: