4 Tips to Optimize MongoDB
4 Tips to Optimize MongoDB
MongoDB performance comes from good concepts, organization, and data distribution. We are going to list some best practices for good MongoDB optimization.
Join the DZone community and get the full member experience.Join For Free
Running out of memory? Learn how Redis Enterprise enables large dataset analysis with the highest throughput and lowest latency while reducing costs over 75%!
In this article, we'll look at four ways to quickly optimize MongoDB.
Have you ever had performance issues with your MongoDB database? A common situation is a sudden performance issue when running a query. The obvious first solution is, "Let's create an index!" While this works in some cases, there are other options we need to consider when trying to optimize MongoDB.
Performance is not a matter of having big machines with very expensive disks and gigabit networks. In fact, these are not necessarily the keys to good performance.
MongoDB performance comes from good concepts, organization, and data distribution. We are going to list some best practices for good MongoDB optimization. This is not an exhaustive or complete guide, as there are many variables. But this is a good start.
1. Keep Documents Simple
MongoDB is a schema-free database. This means there is no predefined schema by default. We can add a predefined schema in newer versions, but it is not mandatory. Be aware of the difficulties involved when working with embedded documents and arrays, as it can become really complicated to parse your data in the application side/ETL process. Besides, arrays can hurt replication performance: for every change in the array, all the array values are replicated!
In MMAPv1, choosing the right field names is really important because the database needs to save the field name for each document. It is not like saving the schema in a relational database. Let's imagine how much data a field called lastmessagereceivedfromsensor costs you if you have a million documents: around 28 MB just to save this field name! A collection of ten fields would demand 280MB (just to save an empty document).
Documents almost hitting this document size aren't desirable, as the database will need a lot of pages to work on one single document. This demands more CPU cycles to finish any operation.
2. Hardware Is Important, But...
Using good hardware with several processors and a considerable amount of memory definitely helps for a good performance.
WiredTiger takes advantage of multiple processors to deliver good performance. This storage engine features a per-document locking algorithm so that as many processors and as many operations as possible can run at the same time (there is a ticket limitation, but this is out of this article's scope). The MMAPv1 storage engine, however, does have to lock per collection and sometimes cannot take advantage of multiple processors to write.
But what could happen in an environment with three big machines (32 CPUs, 128 RAM, and 2TB disk) when one instance dies? The answer is that it will failover — and the drivers are smart enough to read the health instances and write the new primary. However, your performance will not be the same.
That's not always true, but having multiple small/medium machines in a distributed environment can ensure that outages are going to affect only a few parts of the shard with little or no perception by the application. But at the same time, more machines implies in a high probability to have a failure. Consider this tradeoff when designing your environment. The right choices affect performance.
3. Read Preference and writeConcern
The read preference and writeConcern vary according to a company's requirements. But please keep in mind that new MongoDB versions (3.6) use writeConcern: "majority" and readConcern: "primary".
This means that it must acknowledge all the writes in at least floor((N/0.5)+1) where N is the number of instances in the replica set. This can be slow. However, this is a fair trade-off for consistency for speed.
Please make sure you're using the most appropriate read preference and write concern in your company. Drivers always read from the primary, but if it is not a requirement for your environment, consider distributing the queries among the other instances. If you don't, the instances are only for failover and won't get used in regular operations.
4. Working Set
How big is the working set? Usually, an application doesn't use all the data. Some data is updated often, while other data isn't.
Does your working dataset fit in RAM? Optimal performance occurs when all the working data set is in RAM. Wome slowness, like page faults, can hurt performance depending on what you're using.
Reads, such as backup, ETL, or reporting from primaries, can really hurt performance as there is competition to have pages in cache. The same is true for large reports or aggregation.
Having multiple collections for multiple purposes and using specific machines for specific purposes — such as using zones to save documents that will no longer be used — will help to have simple and expected working set.
Published at DZone with permission of Adamo Tonete , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.