Getting Started With Apache Ignite: Part I
Getting Started With Apache Ignite: Part I
In short, Apache Ignite is a distributed in-memory cache, query, and compute engine built to work with large-scale data sets in real-time.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
This is Part I of a multi-part series. In this post, Apache Ignite is introduced. In the next post, we will work with code samples.
A "See You Around" to Cassandra
I used to do technical training and consulting on a well-known project called Apache Cassandra. I joined DataStax back in 2014 when Cassandra was so, so, shiny. People were noticing things like the magic of Netflix and all of the sudden, everyone wanted highly available, peer-to-peer, horizontally scalable databases built for Big Data. People wanted to learn to use Cassandra because it solved a lot of problems that relational databases just can't. Cassandra distributes and scales really well. Writes are ridiculously cheap. That's just how Cassandra works. (If you want to know a little bit more about Cassandra, check out my last blog post.)
There were a lot of problems to bridge this gap (and I could write a whole post just for that). For instance, relational and legacy systems come with a tech workforce educated in that mindset while Cassandra causes a paradigm shift in usage and skill. Additionally, not every relational use case is a use case for Cassandra, even though it's shiny. Also, replacing a database is no small endeavor especially where it requires a steep learning curve (which is also disastrous if you ignore it and treat Cassandra as if it's relational).
In my time there, I learned a lot about a) how to teach people things they don't want to hear or listen to and b) how to help them understand the reasoning and big picture behind those thing. So, here is a picture of me teaching someone things with a weird claw hand to prove my point.
Noticing some market forecasts and some really exciting and cool features in the Ignite project that couldn't be solved by Cassandra, I decided to leave DataStax and join GridGain (the enterprise platform around Apache Ignite) as a Solution Architect. So, enter Apache Ignite. That is what this series is about; assessing if Ignite is a good fit, learning it in a step-wise fashion, and understanding the big picture behind it.
Getting Started With Apache Ignite
I know, it's getting out of control. You need a Mayan Calendar, a compass, the blood of an ox, and a magical incantation to keep up with all these Apache projects. How do you know to separate the signal from the noise? What's a good project that's worth your time and tinkering? Often times, when you follow the buzz, it leads to disappointment and you drowning in the salty well of your own tears.
The best way to save you from that salty well is to break things down into simpler digestible chunks. First, does this project make sense for your use case? Second, try various aspects of the tech in a multi-part series of guides. Why should things be done this way?
To the first point: I've been traveling the world consulting and training people on distributed systems for a few years now and found one common yet unfortunate pattern. I like to call it the square peg into round hole phenomena. This is where someone picks a technology based on reasons of curiosity, resume-building, general bias, or otherwise.
You might get lucky and pick the right thing, but in general, it's bad practice to roll the dice and pick a technology in this fashion. In fact, the best way to think about things is through the perspective of one's use case. Let's first think about the types of use cases we may have, whether the technology we are evaluating can handle or excel with them, and figure out if that's a fit that's worth our time. No salty wells filled with our own tears, remember?
To the second point: never boil the ocean. I find it's just a better experience all around when learning a new technology to learn it in a step-wise manner. It gives you time actually learn. When you practice something, you make the neurons in your brain more powerful. This allows you to recall things faster, have better productivity in that subject matter and leads to less guesswork for you in the long run. TL;DR — practice makes you smarter and better at stuff because brains.
To that, you might say, "Well, we went into prod yesterday, Dani!" I know, it happens. The problem is without the proper foundational knowledge and skills your house will eventually fall (and by house, I mean cluster). So, that means late nights with services, and no one likes that — especially your wife/husband and/or dog.
Perhaps you are the type of person who thrives in chaos and you want to see the most advanced topics — you know, the deep end of the pool type. Sign up for the Apache Ignite dev and user lists and get your hands dirty (maybe even with some contribution).
Defining Apache Ignite and Its Use Cases
To understand what Apache Ignite's use cases are, we need to understand first what it does. So, what is Apache Ignite?
My best definition of Apache Ignite is that it's a distributed in-memory cache, query, and compute engine built to work with large-scale data sets in real-time. A cluster of Ignite nodes (which is simply a combination of server and client nodes) will slide between the application and data layers. From the application side, Ignite will write objects, handling serialization and deserialization itself. There are APIs for applications in Java, .NET, and C++ (with many more slated for the project). Ignite partitions data around the cluster. Each node owns a portion of the overall data and they have a shared-nothing architecture.
Ignite implements high availability in a single datacenter. There are primary and backup nodes for the data. If a node does down, the backup gets promoted as a primary and a new backup is elected. The data is then rebalanced around the cluster, seamlessly, without operator involvement. If you need high availability across multiple data centers, GridGain enhances the Ignite project by adding that feature to the enterprise version GridGain. I will discuss this in a later post.
From the perspective of the data layer, you can hook any database that uses a JDBC or ODBC connection. You can also connect any file system like HDFS, for example. While caching is lightning fast, it is not a system of record. Using Ignite, you are able to do read-through and write-through operations to your database.
"There are only two hard things in computer science: cache invalidation and naming things." — Phil Karlton.
Architecting Ignite to read and write through to the database is one of its best features because it reduces the complexities of a managing a separate cache layer. Instead of dealing with the complexities of a separate cache (i.e., choosing between two-phase commits in a distributed environment or serving stale data), we simply write-through or read-through from our persistence store. Aside from this benefit, there's also the awesome ability to be able to hook in your own database or filesystem with a few lines of code and start using the optimizations of Ignite right away.
Ignite supports all SQL queries, distributed joins and indexing, DML commands and is fully ANSI-99 compliant. It is also fully ACID compliant for distributed transactions. Due to the nature of what Ignite supports, you may already be seeing how Ignite paints itself into a pretty picture as great for relational use cases which require added speed to their transactions. However, there is far more use case-wise that can benefit from the Ignite framework. Here's just a short list:
- Speeding up a relational store.
- NoSQL stores with OLTP workloads (can also with real-time AP).
- Operational analytics and compute.
- Fraud Detection.
- High-frequency trading systems.
- Mission-critical web applications (i.e., online banking).
From a high-level, when should you consider Ignite? If you want to speed up your transactions and are already invested in a relational database, Ignite can help you achieve horizontal scale-out on reads.
So, why hasn't memory already taken over the server room?
Some Background on the In-Memory Scene
People from financial technology (fintech) backgrounds are generally familiar with the terms "data grids" or "in-memory fabrics" to enhance the performance of their clusters or software architectures. This is due to the historically higher cost of memory over disk. Generally, the financial industry has invested in these technologies because of the restrictive time demands with certain use cases like high-frequency trading or online banking. They tend to have tight SLAs and have evaluated the upfront hardware invest as worthwhile for their long-term return on investment (ROI). If you're like me and you came from a distributed database background (or just a database background), you've likely never heard of these terms before. However, you've likely heard of memory of RAM in a regular machine like the one you're reading this blog post from.
So, let's just break down things from the terms of a regular computer. Memory, in general, is roughly 1000x faster than disk depending on what memory/disk types you are comparing against each other for which operations. Simply by using software that utilizes memory rather than disk to read and write data into a cluster, we should optimize our data by roughly 1000x, right? Well, to that, you say, "Party time!"
Not quite. Obviously, it isn't going to be a perfect 1:1 relationship when we take into account things like networking, round trip calls, two-phase commits potentially, dealing with a distributed environment (i.e., ensuring our transactions are as consistent as we want — either ACID or eventually consistent), etc. However, we are still beholden to these things in a disk-bound architecture too, only we don't have the speed optimizations we do with memory. Memory speeds things up quite a bit. Not a perfect 1000x due to other factors based on how you configure your in-memory system (like Ignite) but it will be a lot faster than having I/O as our bottleneck.
So, why aren't databases just used simply as persistence stores or snapshot points while our architectures are optimized to use memory primarily? Why aren't the terms "data grids" and "in-memory fabrics" more common knowledge in tech? It really comes back to funds. Optimizing towards memory was financially out of reach. However, just as disk prices plummeted, we are noticing that the same decreases with memory and the overall trends are following suit. This is likely due to the rise in the popularity of commodity hardware, the cloud, and systems of horizontal scale.
Graph image data is pulled from here.
We have only scratched the surface of where Ignite falls conceptually. There's a lot more ground to cover and importantly, a lot of code to try out! In my next post, we will look at some code samples and I will show you how to start up Ignite.
Published at DZone with permission of Dani Traphagen . See the original article here.
Opinions expressed by DZone contributors are their own.