Database design has evolved greatly over the last ten years. In the past, it used to be the database analyst's job to fine-tune the SQL query and database index to ensure performance. Nowadays, developers play a much more crucial role to ensure the scalability of data.
The database design task, which was once monotonous, has now become an exciting task which requires a lot of creativity. In this short article, let's walk through an example of a real life issue to see how database designs have changed.
In this example, our business requirement is to build a database to store property information for the country. First, we need to store any property and its landlords. If a property is being leased, the system needs to store tenants information as well. The system should also record activities of properties, including buying, selling and renting.
As a typical database system, the user should be able to query the property by any information like address, owner name, district, age, etc. The system needs to serve data for both real time query and reporting purpose.
It is pretty obvious that there are a few entities here like landlords, tenants, transactions, properties. Landlord and tenants can be further analysed as people acting different roles. Moreover, one person can rent out his house and rent another house to live in, which means he can be the landlord of one property and the tenant of another. That leaves us with three major entities: Person, property and transaction. Person and property entities have many relationships to each other. Transaction entities link to a property and at least one person.
If we group some common attributes like occupation, district, and building, it is possible to introduce some other sub-entities that help to reduce redundancy in information.
The Era of the Relational Database (RDBMS)
If you are one of a developer that is trapped in the relational database era, the only viable choice for persistence is relational database. Naturally, each entity will be stored in a table. If there are relationships between two entities, they are likely to refer to each other by foreign keys.
With this setup, there is zero redundancy and every piece of information has a single source of truth. Obviously, it is the most efficient way in term of storage of storing data.
There may be an issue here as it is not easy to implement text searching. Whether it is ten years ago or today, text search has never been supported well by relational databases. SQL language provides some wildcard matching in the language itself, but it is still very far from a full text search.
Assuming that you have completed the task of defining the database schema, the fine tuning part is normally the job of database analysts; they will look into every individual query, adding view, index, and play around with the query itself to increase the performance as much as possible.
If the readers have spent years working on relational database, it is quite easy to see the limit of this approach. A typical query may involve joining several tables. While it works well for low amount of records, the solutions seem less feasible when the number of tables increase or the amount of records in each table increase. All kinds of tweaks like data sharding, scaling vertically or adding indexes only help to increase performance up to a certain level. No magic can help if we are going to deal with hundreds millions of records or joining more than ten tables.
Extension of Relational Database
To solve the issue, developers have tried several techniques that may scale better than a traditional relational database. Here are some of them:
This technique reverses the process of normalizing data when storing to a database. For example, instead of joining property with the building, district or country table, we simply copy all the columns of the relevant records to a property table. Subsequently, duplication and redundancy happen. There is no single source of truth any more for sub-entities like building, district, country. The step of joining tables is simplified.
Explosion is an expensive process that may take hours or even days to run. It sacrifices space plus freshness of data in order to increase real time query performance.
Adding a Document Database
In this technique, a relational database is the source of truth. However, to provide text search, important fields were extracted and stored in a document database. For example, knowing that users will search for people by age, gender and name, we can create document that contains this information plus the record ID and store them to Solr or Elastic Search server.
Real time query to the system will first be answered by searching in the document database. The answer, which includes bunches of record IDs, will later be used by a relational database to load records. In this case, the document database acts like an external index system that helps to provide additional features.
Storing the Whole Data to a NoSQL Database
The other choice is storing data to a NoSQL database. This approach may add a lot of complexity for data maintenance.
To visualize, we can store the whole property or person objects to database. The property object may contain owners and tenants as objects. In reverse, the owner object may include several property objects. In this case, it is quite a hassle to maintain to set of related documents if the data change.
For example, if a person purchases a property, we need to go to the property document to update owner information and go to that person's document to update property information.
Combining Relational Database and NoSQL Database
The Limits of Existing Methods
After scanning through the approaches mentioned above, let's try to find the limit for each approach.
- Relational database normalizes data before storing to avoid duplication and redundancy. However, by optimizing storage, it causes additional effort on retrieving the data. Taking consideration that database is normally limit by querying time, not storage, it doesn't seem to be a good trade off.
- Explosion reverses the normalizing process but it cannot offer fresh data, as explosion normally takse a long time to run. Comparing running explosion with storing the whole entity object to an object-oriented database, it may be easier to maintain for the latter option.
- Adding a document database offers the text search but I feel that it should reverse the options for scalability. Document database is faster for retrieval while relational database is better for describing relationship. Why should we send the record IDs from the document database back to the relational database for retrieving records? What may happen if there are millions of record IDs to be found? Retrieving those records from NoSQL databases is typically faster than relational databases because we do not need to go through joining process.
- As mentioned above, when the entities are interlinked, there is no easy way to separate them out to store to an object-oriented database.
Proposing Combination of Relational and NoSQL Databases to Store Data
Thinking about these limits, I feel that the best way to store data may be combining both relational and an object or document database. Both of them will act as a source of truth, storing what they do best. Here is the explanation of how should we split the data.
We store the data similarly to a traditional relational database but splitting the columns to 2 types of columns:
Columns that store id or foreign keys to other entity ids ("property_id", "owner_id",..) or unique fields
Columns that store data ("name", "age",...)
Remove any columns that store data from database schema. It is possible to keep some simple fields like "name" or "gender" if they help to give us some clues by looking at records. After that, storing the full entities to a document database. Avoid making cross-references in documents.
Explain the Approach by Example
Let's try to visualize the approach by describing how should we implements some sample tasks
- Storing a new property owned by a user
- Configure JPA to only store name and id for each main entity like person, property. Ignore all data fields or sub-entities like building, district, country.
- Storing this property to relational database, retrieving back stored object with persisted id.
- Storing property with updated id to document database, setup the text index for searchable field.
- Storing owner to document database but removing the link to property.
- Querying property directly
- Sending query to document database, retrieving back record.
- Querying property based on owner information
- Sending query to relational database to find all property that belong to the owner (assume that owner id is known. If not, the pre-step is to query document database to find owner first).
- Sending query to document database to find these property by ids.
In the above steps, we want to store records to a relational database first because of the auto ID generation. After the first step, we have a very thin relational database that only captures relationships among entities rather than the entities themselves.
To avoid cross-reference, we make a choice of including owner information in property object rather than property information in owner object. It is a practical choice that depends on predicting future queries. With this setting, querying property will be fast because it only requires single query for the noSQL database.
Summary of the Approach
Finally, let's summarize the new approach:
- Treating main entities as independent records.
- Treating sub-entities as complex properties.
- Storing main entities to noSQL database.
- Storing ID, name and foreign keys of main entities inside relational database. The relational database is serving as a bridge, linking independent objects in the NoSQL database.
- Any CRUD will always require committing to two databases at the same time.
- Store new object to relational database first before because of ID generator.
- Offload the storing data task from relational database but let it do what it can do best, store relationships.
- Free to choose any scalable database that offers text search for storing entities.
- Best of both worlds with text search and scalability of NoSQL database and RDBMS' searching of relational database.
- Maintaining two databases.
- No single source of truth. Any corruption happening in one of the two databases will cause data loss.
- Code complexity.
- Storing data to a graph database that offers text search capability. This is quite promising as well but I have not done any benchmarks to prove feasibility.
The solutions is pretty complex but I found it is interesting that the scalability issue is solved at the code level rather than the database level. By splitting the data out, we may tackle the root cause of the issue and be able to find some balance between performance and maintenance effort.
The complexity of the implementation is very high but there is no simple implementation for big data.