The Definitive Guide to the Modern Database
The Definitive Guide to the Modern Database
Join the DZone community and get the full member experience.Join For Free
Data processing is ubiquitous (this should be a surprise to exactly no one). From money transfers to your phone contacts being stored in the cloud, the average consumer unknowingly benefits from data processing daily. Meanwhile, the options for software and hardware solutions for data applications are multiplying with no end in sight. However, these solutions leave the data “as is,” waiting to be processed and stored efficiently. The modern world is indeed data-driven.
Enter the database management system (DBMS). According to TechTerms, a DBMS is “a software system that uses a standard method of cataloging, retrieving, and running queries on data.” Established in the 1970s, the DBMS is a widely-known concept and well-researched area. However, there are still no easy answers or “silver bullets,” but many open questions instead. This creates an interesting dynamic; there are more than 150 special-purpose DBMS vertical solutions available today. It’s overwhelming to say the least.
Perhaps one way to make the database world less overwhelming is to view it in terms of dilemmas, or trade-offs. No single DBMS excels in every single area; each is created based on compromises between different extremes. To gain a comprehensive view of the modern database environment, we propose several trade-offs to consider: Human-generated vs. Others, Structured vs. Unstructured, Transactions vs. Analytics, In-memory vs. Disk-based, Hardware vs. Software, Scale-in vs. Scale-out, Consistent vs. Inconsistent, SQL vs. NoSQL, “Classical” vs. “Modern,” and Data manipulation vs. Business logic.
Trade-off #1: Human-Generated vs. Others
The rise of the Internet has created exponential growth of digital data. Each day, new bits of data are created based on substantial sets of previously available information. Building blocks upon building blocks of data. In other words, knowledge acquisition processes are self-accelerated by nature. With so much data available, it’s hard to even comprehend the amount. According to Google CEO Eric Schmidt, “Between the birth of the world and 2003, there were five exabytes of information created. We now create five exabytes every two days.”
Of all this data, we are focused on the human-generated data, such as emails, videos, online-purchases, or bank transactions. We exclude scientific data streams from consideration (such as images from telescopes or measurements from the Large Hadron Collider). Because these kinds of data can create unlimited amounts of information once the measured object is defined, the focus is limited to creating unique solutions that store and process this numeric data. A recent real-world example is the MatrixNet, a machine learning tool used by nuclear physicists at The European Organization for Nuclear Research (CERN). Unlike the machine learning approach of scientific datasets, human-generated data presents information in a way that makes sense for DBMS parameters that are critical for every-day use applications.
Conceptual growth rate for human-generated transactions
vs other human-produced objects like texts and images:
rate is exponential still, while magnitudes are much lower
Trade-off #2: Structured vs. Unstructured
The concept of structuredness becomes crucial when discussing human-generated data. To differentiate structured data from unstructured, one can think of a person’s name, like “John.” If we treat “John” just as a sequence of four letters or one word of a natural language, then it is an unstructured point of view. The main source of such data today is the human-written part of the Internet (i.e., most parts of most web pages). In contrast, if it is known that this sequence of letters is a value of a field “Name” in a particular CRM, then this data is structured, because its “meaning” in a certain context is defined. Another good example of a structured set is any business-critical transactional database, because the contents of transaction are always defined.
The observation of real-world structured datasets gives an unobvious fact: their sizes are fairly moderate in practice. Even if the growth of any particular set is exponential, the exponent is still orders of magnitude smaller than those exabyte rates for the unstructured Internet. In other words, structured datasets are much smaller in scale compared to unstructured sets. As an example, a recent estimation of annual Amazon transactions was 56 Gb in total.
When we last discussed a couple of initial trade-offs to keep in mind when choosing a modern database, we explored Human-Generated vs. Others, as well as Structured vs. Unstructured. In this blog post, we’ll dive deeper into a third trade-off to consider.
Trade-off #3: Transactions vs. Analytics
The evolving Big Data phenomenon has highlighted the importance of DBMS partitioning for transactions (aka OLTP, or online transactional processing) in a classic sense, and for performing data analysis (aka OLAP, or online analytical processing), including measurements and new observations from existing datasets. These two technologies serve two points of view for the same data, from which certain tasks can be solved efficiently. While various write-conflicts are frequent in transactions (like when two agents withdraw from the same bank account), analytics can treat the whole dataset statically to run read-queries at peak performance. In fact, in many business applications, heavy analytics are done overnight on a snapshot of transactional data, often on a separate server, to deliver results the next day.
Sometimes, in search engines or high-frequency trading, analytics must be performed in parallel with current transactions. In this respect, hybrid solutions that perform well in both OLTP and OLAP are now considered cutting-edge. One feasible approach to hybridization is a “materialized aggregation,” where you accumulate analysed parameters iteratively — step-by-step, at every operation with your data. This requires additional sophistication for analytics; for example, it is simple to aggregate mean values iteratively, but that is not so for median values. However, this is not exactly rocket science; the “running average” approach serves traders for tens of years, and you can accumulate a median efficiently within 0.1% deviation. In return, you have all your analytics available instantly in place, right in your database. Adding new parameters to analyse is not a problem either. You just decide a new parameter and run a single aggregation procedure overnight to get a current value for it, and update iteratively any time further. To update hundreds of parameters real-time concurrently, one must use an extremely high-performance transactional database.
While hybrid solutions are often connected with trade-offs between speed of transaction and analytical capabilities, their drawbacks will become obvious in our upcoming blog posts.
We’ll focus on three more trade-offs – In-memory vs. Disk-based; Scale-In vs. Scale-out, Consistent vs. Inconsistent; and SQL vs. NoSQL, “Classical” vs. “Modern” – in the next blog post of this series. Until then, check out this white paper to learn more about Starcounter’s combined in-memory database and application server.
Trade-off #4: In-memory vs. Disk-based
OLTP DBMS receive a great deal of attention due OLTP acting as a grounding technology for many existing and evolving businesses, such as digital advertising and electronic trading. A pioneer in the DBMS R&D space, Dr. Michael Stonebraker, stated that the best way to organize data management is to run a specialized OLTP engine on current data. However, today’s partitioning of DBMS into disk-based and memory-based is quite odd in comparison to the OLTP/OLAP predicament.
In 2005 the cost for 1MB of RAM experienced a substantial dip, leading all business-critical transactional databases to become in-memory. Today, moderately priced modern serve configurations now include 128 GB of RAM, and disk-based databases are still very important in the OLAP space. However, this is not the case with most transactional applications, which benefit greatly from in-memory solutions. Memory-optimized databases still use disks to implement data persistency and reliability features, but in contrast to disk-based DBMS, they do so in a more “gentle” and efficient way.
Trade-off #5: Scale-In vs. Scale-out, Consistent vs. Inconsistent
Partitioning data across several computers is a viable option to increase speed. This practice of scaling-out is based on expecting performance to grow linearly with the number of servers. However, scaled-out systems lose transactional consistency, meaning the system resides in a non-conflicted state all the time. For example, when money is transferred from one account to another, this operation should either end with the same amount residing on the other account or discarded fully.
Trade-off #6: SQL vs. NoSQL, “Classical” vs. “Modern”
Another route to gain performance is to abandon select DBMS features. From 2007 to 2012, the “classical” DBMS systems with rich SQL-based syntax shifted to NoSQL DBMS, or lightweight, non-consistent and fast in-memory storages, providing sometimes only key-value functionality. This shift offered an alternative to a widely used but pricey corporate DBMS. changing the current state. DBMS users concluded that SQL is a key component to reduce costs of DBMS implementation and ownership, even outside of large enterprises. Now DBMS users request even richer syntax, supporting graph operations and basic OLAP right out the box. This has resulted in the emergence of NewSQL systems that offer performance, consistency and SQL features. Creating hybrids of SQL and NoSQL is an increasing trend, leading to even more competition and innovation in the market.
We’ll focus on one more trade-off in the next and final blog post of this series. Until then, check out this white paper to learn more about the vision behind Starcounter and the evolution of its in-memory application platform.
Opinions expressed by DZone contributors are their own.