My first opportunity to dive into the big data landscape was a real eye opener. So many products suppose to have the solution to big data problems, but the reality is quite different – many competing products have overlapping features that are suitable for solving a set of problems.
Add to that, the confusion of buzz words. In an industry of jargon, buzz words thrown into a discussion can quickly influence the conversation or focus it on differences and products, rather than on the task at hand.
By defining a few broad terms, we can establish a common vocabulary for more productive dialogue about big data projects.
But before we start, let’s reiterate a few terms that I’m sure, all of us use regularly:
Server: A dedicated strong hardware that can be extended, to a limit, by adding more disk for space and CPU for horsepower.
Cluster: A collection of fairly inexpensive, loosely connected machines (or nodes, I will use these terms interchangeably) that act like a big server. New machines can be added for additional disk space and processing power.
Vertical Scaling: The process of making a machine stronger, by adding more disk for space, and CPU for horsepower.
Horizontal Scaling: The process of making a cluster stronger, by adding additional nodes.
Distributed Computing: A large topic, but within our context, this term means out-of-the-box support for functionalities like indexes, auto scaling, data partition and replication, failover and self-healing, and much more.
Establishing a shared understanding, we can define main terms:
Classic SQL: Standard relational database management systems (RDBMS) fall into this category. I.e. Oracle, SQL Server, MySQL, Db2 etc. These databases are designed to store data, within a single machine, using collections of tables and their relationships. These products store and retrieve data in an optimum way, but they were designed to scale vertically. As we know, there is a limit to vertical scaling, and with the rise of social media and other data intensive applications, – it started becoming a bottleneck.
No SQL: It may raise a few eyebrows, but I personally consider NoSQL to be a brilliant hack. Please don’t get me wrong, I love the technology. It’s perfect to solve many business problems. But if I look at the history, engineers started experimenting with different ways to overcome constraints with Classic SQL and ended up with one distributed (with all distributed computing bells and whistles) hash map. It was a great freedom from schema constraints, ACID was limited to a single key-value entry, and simplicity of storage boosted performance many folds. But remember, it’s all not just magic — we still have to choose between high availability and reliability; additional care is required by developers for data consistency, and data needs to be stored with a perspective of fetch queries (data partition keys), etc. I hope you got the picture. Products that fall into this category include Cassandra, HBase, DynamoDB, Redis, LevelDB, etc.
New SQL: As mentioned above, NoSQL DBs are great for handling high data volume and processing, but they lack the relation level of data consistency and transactional support. NewSQL is a simple idea to bring the good parts of relational and NoSQL databases together. It aims to provide the same SQL and transactional support as RDBMS along with horizontal scaling. It’s a big challenge, especially in supporting table joins over distributed environments (imagine two tables living in two data centers), and stored procedures (developer’s code) where the probability of going wrong is far higher than the potential advantages. In short, NewSQL databases are RDBMS sitting on top of distributed infrastructure — for example, MemSQL and VoltDB.
In-Memory Database (IMDB): Any database product that keeps all data in memory for high throughput can be categorized as an IMDB. Most NoSQL and NewSQL databases are considered IMDBs. It’s not necessary for all the data to be in RAM. In most cases, these products are backed by disk-based storage (or support directly) and supports read through and write behind. Essentially, all NoSQL and NewSQL databases can be considered IMDBs.
In-Memory Data Grid (IMDG): Compared to IMDBs, IMDGs are more for general purpose, offering multiple data structures on top of distributed platforms and allowing interaction of data with various interfaces like key-value, Object, JCache, etc. A SQL-based interface can be just of one those. IMDBs offer a small subset of IMDGs' functionality (provided it has the right SQL-based support).
Also, IMDGs exposes their distributed platform via simple APIs like Java concurrent frameworks, the Map Reduce style API (which could have additional abstraction to simplify the usages), and more, allowing interaction of grids with heterogeneous client technologies using binary protocols.
Which to Use, and When
There is no simple answer but here are few guidelines.
If you have a business scenario where data (now and future growth) falls well within the parameters of Classic DB, then by all means, go for it. Those products are well proven, optimized, and time-tested.
If data requirements are high in terms of size and processing — and data consistency is not the biggest concern — then NOSQL may be a better fit.
If data requirements are high in terms of size and processing — and data consistency is also a big concern — well, in that case, we have a look at NewSQL products. But remember, scalability would be limited compared to NoSQL products, which is obvious as NoSQL won’t have relation overhead (settling data and JOINs). So if the data volume is really high, then I think you may be better off using NoSQL and writing your own code to keep data consistency in check.
On a different note, I think NewSQL databases are ideal replacement for existing Classic DB. If not now (because of production stability) then definitely in few years.
And, if you are in one of the below scenarios (popular ones):
- Classic DB has become the bottleneck but you are not in a situation to replace it.
- You are being charged per transaction and cost is getting higher with usage.
- Data needs to be collected from various sources (databases, files, etc.) and needs to be processed further.
- Performance tuning via caching layer.
- Running time-consuming parallel computations.
Go for a data grid.
I hope this will be helpful in clearing up some confusion among your teams.