DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • MongoDB to Couchbase for Developers, Part 1: Architecture
  • MongoDB to Couchbase: An Introduction to Developers and Experts
  • Manage Hierarchical Data in MongoDB With Spring
  • Spring Data: Data Auditing Using JaVers and MongoDB

Trending

  • Developers Beware: Slopsquatting and Vibe Coding Can Increase Risk of AI-Powered Attacks
  • My LLM Journey as a Software Engineer Exploring a New Domain
  • Scalable, Resilient Data Orchestration: The Power of Intelligent Systems
  • Medallion Architecture: Efficient Batch and Stream Processing Data Pipelines With Azure Databricks and Delta Lake
  1. DZone
  2. Data Engineering
  3. Databases
  4. Bulk Operations in MongoDB

Bulk Operations in MongoDB

By failing to take advantage of array inserts, you're essentially sending network packets that could take hundreds of documents over with only a single document in each packet.

By 
Guy Harrison user avatar
Guy Harrison
·
Nov. 13, 17 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
41.6K Views

Join the DZone community and get the full member experience.

Join For Free

Like most database systems, MongoDB provides API calls that allow multiple documents to be inserted or retrieved in a single operation.

These “Array” or “Bulk” interfaces improve database performance markedly by reducing the number of round trips between the client and the databases – dramatically. To realize how fundamental an optimisation this is, consider that you have a bunch of people that you are going to take across a river. You have a boat that can take 100 people at a time, but for some reason, you are only taking one person across on each trip — not smart, right? Failing to take advantage of array inserts is very similar: you are essentially sending network packets that could take hundreds of documents over with only a single document in each packet.

Optimizing Bulk Reads Using .batchsize()

When retrieving data using a cursor, you can specify the number of rows fetched in each operation using the batchSize clause. For instance, below we have a cursor where limit controls the total number of rows we will process, while and arraySize controls the number of documents retrieved from the mongoDB database in each network request.

cursor = useDb.collection('millions').find().limit(limit).batchSize(arraySize);

for (let doc = await cursor.next(); doc != null; doc = await cursor.next()) {
    counter++;
}

Note that the batchSize operator doesn’t actually return an array to the program — it just controls the number of documents retrieved in each network round trip. This all happens “under the hood” from your programs point of view.

By default, MongoDB sets a pretty high-value for .batchSize, and you might easily degrade your performance if you fiddle with it. However, if you are fetching lots of small rows from a remote table you can get a significant improvement in throughput by upping the setting. Below we see the effect of manipulating batchSize(). Settings of batchSize below 1,000 made performance worse - sometimes much worse! However, settings above 1,000 resulted in significant performance improvements (note the logarithmic scale).

batchSize

Avoiding Excessive Network Round Trips in Code

batchSize() helps us reduce network overhead transparently in the MongoDB driver. But sometimes the only way to optimize your network round trips is to tweak your application logic. For instance, consider this logic:

for (i = 1; i < max; i++) {
    //console.log(i);
    if ((i % 100) == 0) {
        cursor = useDb.collection(mycollection).find({
            _id: i
        });
        const doc = await cursor.next();
        counter++;
    }
}

We are pulling out every hundredth document from a MongoDB collection. If the collection is large that is a lot of network round trips. In addition, each of these requests will be satisfied by an index lookup and the sum of all those index lookups will be high.

Alternatively, we could pull the entire collection across in one operation and then extract the documents we want.

const cursor = useDb.collection(mycollection).find().batchSize(10000);
for (let doc = await cursor.next(); doc != null; doc = await cursor.next()) {
    if (doc._id % divisor === 0) {
        counter++;
    }
}

Intuitively, you might think that the second approach would take would take much longer. After all, we are now retrieving 100 times more documents from MongoDB, right? But because the cursor pulls across thousands of documents in each batch (under the hood), the second approach is actually a lot less network intensive. If the database is located across a slow network, then the second approach will be much faster.

Below we see the performance of the two approaches for a local server (for example, on my laptop) versus a remote (Altas) server. When the data was on my laptop the first approach was a little faster. But when the server was remote, pulling all the data across in a single operation was far faster.

fetch Strategy

Bulk Inserts

Just as we want to pull data out of MongoDB in batches, we also want to insert in batches - at least if we have lots of data to insert. The code for batch insert is a bit more complicated than for the find() example. Here’s an example of inserting data in batches.

if (orderedFlag == 1)
    bulk = db.bulkTest.initializeOrderedBulkOp();
else
    bulk = db.bulkTest.initializeUnorderedBulkOp();

for (i = 1; i <= NumberOfDocuments; i++) {
    //Insert a row into the bulk batch
    var doc = {
        _id: i,
        i: i,
        zz: zz
    };
    bulk.insert(doc);
    // Execute the batch if batchsize reached
    if (i % batchSize == 0) {
        bulk.execute();
        if (orderedFlag == 1)
            bulk = db.bulkTest.initializeOrderedBulkOp();
        else
            bulk = db.bulkTest.initializeUnorderedBulkOp();
    }
}
bulk.execute();

On lines 2 or 4 we initialize a bulk object for the bulkTest collection. There are two ways to do this – we can create it ordered or non-ordered. Ordered guarantees that the collections are inserted in the order they are presented to the bulk object. Otherwise, MongoDB can optimize the inserts into multiple streams which may not insert in order.

On line 9, we add documents to the “bulk” object. When we hit an appropriate batch size (line 11), we execute the batch (line 12) and reinitialize the bulk object (lines 14 or 16). We do a further execute at the end (line 19) to make sure all documents are inserted.

I inserted 100,000 documents into a collection on my laptop, using various “batch” sizes (for example, the number of documents inserted between execute() calls). I tried both ordered and unordered bulk operations. The results are charted below:

bulkinsert

The results are pretty clear — inserting in batches improves performance dramatically. Initially, every increase in batchsize reduces performance but eventually the improvement levels off. I believe MongoDB transparently limits batches to 1000 per operation anyway, but even before then, the chances are your network packets will be filled up and you won’t see any reduction in elapsed time by increasing the batch size. To use the analogy we used at the beginning of this post, the rowboat is full!

Summary

A lot of the time in MongoDB we perform single document operations. But just as often, we deal with batches of documents. The best practice coding techniques outlined in this post can result in huge performance improvements for these scenarios.

MongoDB Database Document Network

Published at DZone with permission of Guy Harrison, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • MongoDB to Couchbase for Developers, Part 1: Architecture
  • MongoDB to Couchbase: An Introduction to Developers and Experts
  • Manage Hierarchical Data in MongoDB With Spring
  • Spring Data: Data Auditing Using JaVers and MongoDB

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: