Overview of Common Data-Keeping Techniques Used in a Distributed Environment
Overview of Common Data-Keeping Techniques Used in a Distributed Environment
Look at some common data-keeping techniques that are used in a distributed environment.
Join the DZone community and get the full member experience.Join For Free
This article summarizes a very high-level overview of the common data handling techniques used in distributed environments along with some of their key points and advantages.
Remember those old days of RDBMS where we used to organize the associative set of columns in the same table with a foreign key as referential entities, mostly to reduce the redundancy of data across different tables? For example, instead of putting 'employee_ name' column in employee's personal_detail table and address_detail table, we used to keep it in personal_details only, whereas 'emp_id' can be a foreign key in the address_detail table.
- Big tables split into various tables having a subset of columns to reduce the duplicate and redundant column values.
- Normalization process consists of various stages like 1NF, 2NF, 3NF, and Boyce & codd (BCNF).
- Various referential entities involved like Unique_key, Foreign_key, Super_key, etc.
- Joins are needed to fetch the data from multiple tables.
- To remove the database anomalies during insert, update, and delete operations on tables due to data redundancy.
- Searching, sorting, and creating indexes is faster, since tables are narrower, and more rows fit on a data page.
- Index searching is often faster, since indexes tend to be narrower and shorter.
- More flexibility in tuning the queries.
- Fewer null values and less redundant data, making the database more compact
Here is a good article on this topic: https://beginnersbook.com/2015/05/normalization-in-dbms/
Partitioning is a general term that refers to the database splitting (sharding) only. It can be achieved in many forms, for example, column division or row-wise division. Partitioning divides the big database containing data tables and indexes into smaller and handy slices of data called partitions. The partitioned tables are directly used by the queries without any alterations.
Once the database is partitioned, the data definition language can easily work on the smaller partitioned slices instead of handling the giant database altogether. This is how partitioning cuts down the problems in managing the large database tables.
- Partitioning advances query functionalities. Queries can be more responsive on the collection of relatively small partitions instead of the giant database, hence the functionality and performance level gains and improves.
- Parallelism, data can be fetched and processed by the application in parallel fashion which reduce the processing time.
- The manageability and maintenance of smaller partitions is relatively easy compared to a big database.
- And, of course, high availability can be achieved by replicating the partitions.
Database sharding is the process of splitting the data present in one database/table into multiple databases/tables having the same schema such that the data is divided into multiple smaller distinct blocks, known as shards. In other words, sharding is a method of splitting and storing a single logical dataset in multiple databases and often on multiple machines.
In simple words, splitting the table row-wise means distributing the table rows among multiple databases and is known as Horizontal Sharding.
This idea is very general in nature. That’s why sharding can be implemented at either the application or database level. In many databases, sharding is a first-class concept, and the database knows how to store and retrieve data within a set of databases i.e. cluster.
Nowadays, almost all modern databases natively support sharding. Cassandra and MongoDB are some the of popular distributed databases.
- Same schema is replicated in all the databases.
- Shard key is important to look up where the required data resides.
- Unique set of data is present in each shard.
- Sharding solves the scaling problem on the horizontal axis by adding more databases to hold new data.
- It's a cheaper solution as compared to data partitioning.
In general, sharding is recommended as soon as any of these conditions are met:
- Write/Read workload is too high to be handled by a single server. Many concurrent users hitting a single database.
- The working dataset or index no longer fit in memory.
- The dataset is too big to easily fit even in a single server, and maintainability is a nightmare.
Note: #1 and #2 are by far the most common reasons why people need shards in most of the traditional databases.
For sharding the data, a key is required and is known as a shard key. This shard key is either an indexed field or an indexed compound field that exists in every document in the collection.
There is no general rule to select a sharding key; what key you choose depends on your application and data. For instance, you may choose ID as the shard key in an employee database. For example, it can be dependent upon the frequent search criteria that is being used across the application, geo-locations of the databases, etc.
Sharding allows your application to make fewer queries. When it receives a request, the application knows where to route the request, and thus it has to search through less data rather than going through the whole database table.
It improves the overall performance of your application, reduces latency, and lets you rest easier, not having to worry about scalability issues.
Conceptually, Vertical Sharding is nothing but a form of data partitioning. This can be done by splitting the big schema into various small schemas and accordingly storing the different chunks of data in separate machines. Splitting a table column-wise, which means columns of single tale, are distributed among different databases. This happens considering the domain of business application. We do a logical split within the application data and store the data in different databases.
Mostly, this is achieved at the application level. The code is responsible for doing reads and writes from the designated databases.
- Schema is split logically among different databases.
- Mostly involves domain knowledge.
- Duplicate columns can be present in each database. For example, a primary key column 'Id' as a foreign key in a different database.
- It solves the scaling problem on the vertical axis by increasing more space or computation power in the same machine to hold or compute new data.
- It's a costlier solution as compared to sharding after a certain level.
- Done primarily for space and performance concerns. For example, a customer's billing address information is dumped into a separate table with a CustomerId as a reference so that in the future, the design has the flexibility to move that information into a separate database for different security concerns, etc.
- There are some blob kinds of columns that hold really big images or multimedia info that can be easily stored into a separate database, which is planned for different maintenance schedule.
- Some sort of data needs a special kind of database like time-series databases for efficient processing.
Data Replication is the process of keeping data on more than one machine which can be present on geographically different sites. It is beneficial in improving the performance and availability of data especially in a distributed environment. This technique simply copies the data from a database present on one machine to another machine so that all the users can see the same data without any inconsistency.
- Data replication encompasses transaction level duplicity so that it can replicate the information in a consistent state and synchronizes with the source database.
- Full replication is a scheme in which the whole database is stored at every site.
- Partial replication is a scheme in which some frequently used fragment of the database is replicated and others are not replicated.
- To increase the availability of data.
- To increase the throughput of the application.
- To minimize the risk of data loss or data corruption.
- To minimize the overall latency, making application more responsive.
So, this is a small effort to summarize the high-level overview of various database concepts in one place.
As always, drop me a line in the comments if I've made any mistakes or if I can be helpful in any way. Thanks!
Opinions expressed by DZone contributors are their own.