Before studying the features of a variety of NoSQL systems, it would be useful to learn the background history of the NoSQL storage system.
NoSQL as antithesis
Here is a little mnemonic device. If NoSQL is No SQL, it must be an anti-RDBMS system. The logical flow explaining why NoSQL is the opposite of RDBMS is as follows:
RDBMS defines the SQL language by using a data manipulation method that can model relational data (relation and table), and supports transactions with the ACID attribute.
The scale-out approach which distributes data before processing it, is in high demand due to the high data-throughput demand of Internet-scale services.
The scale-out approach of existing RDBMSs is intended to maintain operational integrity of the relational model, which is the core of RDMBS, and transaction operations.
It is difficult to scale-out while maintaining integrity. The same problem arises when distributing or replicating data. So we make the ACID-based transaction attributes that are ensured by DBMS or the replication integrity model less strict before we use them to scale-out.
Scale-out of RDBMS
Scaling-out is difficult in RDBMS. If scaling-out to several thousand RDMBSs were easy, prominent database developers such as Oracle would have released a product or two for that a long time ago.
Now let us assume that the tables of RDBMS are distributed to several computers, and each piece of data is replicated before it is stored for high availability. First, executing distribution transaction while satisfying ACID is difficult in scale-out.
To satisfy the atomicity attribute of ACID, the distribution transaction protocol such as the 2PC protocol must be used in all systems that are related to a specific transaction.
To match the isolation level among ACID attributes, data must be locked in general. The units of Locking can be a record, a table, or an index.
Therefore, to satisfy the Atomic and Isolated attributes in a distributed environment, all related locks must be applied to each system while the distribution transaction protocol is being processed; the higher the service load of the system, the heavier the lock competition becomes. This is what makes scaling-out difficult.
Another problem is that there is a limitation to scale-out by replicating and distributing data.
The transactional replication method using the 2PC method has a problem in which a transaction fails and becomes unavailable when one of the systems related to the replication process fails. In addition, the performance degrades when several systems are involved in the replication.
As an alternative, it is possible to pass the Write Ahead Logging (as known as WAL) data of a DBMS to the replication system and have it apply the data. If we consider the system in which replication occurs as a master (or primary), and the system to which the changes are applied as a slave (or backup), they are configured either as master-slave or multi-master.
When configuring master-slave: This is the most commonly used replication method. The speed of the process is in inverse proportion to the number of systems involved in replication in this method.
When configuring multi-master: It is difficult to solve the collision between data write processes or prevent it from happening when there are several masters. In The Danagers of Replication and a Solution Jim Gray conducted a study on this issue (Received Turing Award in 1998 for his contribution to the related database and transaction processing) .
Sharding by developers
Generally speaking, it is extremely difficult to scale-out while satisfying the ACID attribute in the DBMS data model. For this reason, to scale-out based on DBMS, one must simplify the data model itself, partition the data by the number of N, and then execute the query within a separate piece of data.
The unit of partitioned data is called a 'shard'. Distribute and service N number of shards to M number of DBMSs. DBMS does not manage shards. This is the responsibility of the service developer.
The sharding method is focused on developers and has the following difficulties:
First, a shard must be defined.
The basic storage unit of a DBMS is the table. Because a table can contain one or more shards, there is a need to know which shard is mapped to which instance of a database. The locations of shard tables must be known to the application.
As each shard is different, so are the throughput requirements and data size. As a result, a developer must add a new instance to or delete one from the database, and redistribute the shards manually. This is a painstaking and labor-intensive process.
The mapping information that has been modified by the distribution/redistribution process must be applied to the application.
Management, such as configuring for the replication, is necessary when modifying data.