Design Decisions: The Data Model and MongoDB
Design Decisions: The Data Model and MongoDB
We had to change the data model from a collection of environments to a representation of groups of environments. Here's how we did it!
Join the DZone community and get the full member experience.Join For Free
Discover Tarantool's unique features which include powerful stored procedures, SQL support, smart cache, and the speed of 1 million ACID transactions on a single CPU core!
This week, we had a dilemma at BigPanda about how to model a new data type we were requested to add to our system using MongoDB.
We do smart aggregation of alerts from IT monitoring systems to help NOC teams. Our main screen contains a feed of incidents, and because an organization can have a decent amount of incidents (even after de-duplication and correlation), we have a feature of environments. The users can categorize the incidents to separate environments according to source systems, clusters, and so on.
The new requirement we’ve received is to create another level of hierarchy so that we have groups of environments. For example, one group of environments can belong to a specific team, while another group of environments can be the responsibility of a different team.
We had to change the data model from a collection of environments to a representation of groups of environments. Before I elaborate on our constraints, let’s review Mongo’s data modeling concepts. If you’re already familiar with this, feel free to skip this part.
Embedded Data Models (Denormalized)
The first concept we'll introduce is embedded data models, where a document contains inner objects. Those are good to use both in one-to-one and one-to-many relationships.
Embedded data model (screenshot from MongoDB’s documentation)
The advantage of embedded data models is that they allow fewer queries and updates, as well as better performance for read operations. The disadvantage is that large documents can cause performance issues for writes.
Normalized Data Models
This is similar to the classic relational concept: in one document, there’s a link to another document’s ID (from a different collection).
Normalized data model (screenshot from MongoDB’s documentation)
Mongo doesn’t support joins, so querying this model requires two different calls. The link is done by manual reference to the
_id field, or by DBRefs, which is an object combining the
_id, collection name, and sometimes DB name. The manual reference is recommended for common cases.
Normalized data models are good for use when embedding would result in the duplication of data, but won’t provide sufficient read performance advantage. It is also good for complex many-to-many relationships or large hierarchal data sets.
As said, we have a collection of environments and we need to group them. I’ve found that it’s helpful to map the operations we’ll need to perform with the new data. Here it is:
- Get the list of all groups
- Get the list of environments, by group
- Create, update, and delete groups
- Have a default group, and if a group is deleted, all environments move to the default group
I also mapped the expected read/write load. In this case, we’ll have a few writes, and a large number of reads, but we have cache for the reads, and both the environments and groups don’t change very often, so the cache decreases the reads to little.
The last mapping I’ve done is the expected size of the data — the new group, for now, will have only one property (except for ID): its name.
These are the three options we had:
And below are our considerations.
Updating Environment’s Group
- In Option 2, we have two DB calls to update a group: one for popping from the old list and the second for pushing to the new list. In Options 1 and 3, an update takes only one DB call.
- In Option 2, we need to keep the previous state so that we know which list to pop the environment from. In Options 1 and 3, we don’t need to keep the previous state because we can just override the existing group without knowing what it is.
- In Options 1 and 3, updating the environment and its group can be done by one call to the server, while Option 2 enforces us to add more calls.
These are not major issues since
UPDATE isn’t a common action, and as long as we make the necessary calls in one transaction, we'll be good; but it adds some code lines we can avoid.
- We thought that maybe Option 3 would spare us implementing new restful API around groups, but we’ll still need to implement, for example, the
delete groupoperation as a patch for the environments and maybe also separate RBAC for groups. So, in order to make it conventional, REST it will have to be a separate API even if it’s in the same collection.
We would choose Option 3 if we didn't have any other constraints, but we had to go with Option 2 because Options 1 and 3 force us to change the existing schema in a way that affects another microservice we have (and we hope to change soon) and it's a bit problematic.
Option 3 is the most conventional way, but since there aren’t any performance issues and we’ll have to write some extra code in all cases, it’s not problematic to choose another option.
I hope this post will help you consider all kinds of paths when you’ll need to model you data in the future.
Published at DZone with permission of Dafna Rosenblum . See the original article here.
Opinions expressed by DZone contributors are their own.