Cassandra Complex Queries — Lessons Learned
Cassandra is a great DB, but not for everything. It works great with sensor data, time series data, and easy reads without too much query combinations. When there is a need to do read before write, you are doing something wrong, so stop and rethink. Read on to learn more.
Join the DZone community and get the full member experience.Join For Free
Just a couple of years ago, the decisions faced by software architects were quite easy, such as which language to choose, in which DB to store data, what kind of frontend technology to use etc. Basically, you could divide your application into a couple of building blocks, choose the best for each building block, and use it throughout the whole application.
Big distributed systems, with huge data flow changed this game, microservices are the new normal, and this multiplied problems with technology choices for each microservice. Now you should be a polyglot in every part of the system and choose the best tool for each use case. But still, the assortment of many technologies in one system comes with a price. Maintenance costs are higher, it is hard to find people which will fit in the team, picking up small details of many technologies is hard, so teams usually stay on basic to advanced level of knowledge in chosen technologies.
Cassandra is scalable and distributed partitioned row storage optimized for fast writes. It is an AP database based on CAP theorem with tunable C (tunable consistency). Basically, you choose what is more important: availability or partition tolerance, and tune the consistency based on your choice. It is a masterless database—a new node in the cluster can pick up information from any node, which means there is no single point of failure, and database can be scaled horizontally.
As for data modeling, it is important to understand some of the implementation internals. Data is stored in rows which have partition keys, so the right choice of this key is important to build up performance. This is also important for data modeling, since in comparison with relational data modeling data is modeled as it will be read, not written. Basically, you should think of queries and model data like that. Data is denormalized, it can exist in multiple columns and each column is optimized for one query. For example, if you need to find books by genre, but also by author, you will most likely end up with two columns, one with genre key and another with author key. So, searching by genre would be one disk seek which will pick up the whole column of books partitioned by requested genre.
Designing, as explained above, makes Cassandra a perfect candidate when working with events. They come sequentially and you can store them in one column—sorted by time and partitioned by data—which will be interesting when doing queries. We worked on an event system and that was the main reason why we chose it. Also, you can partition events into time buckets (by day, week, month) and view each time bucket as a new column, which makes the perfect data store for drawing charts. The data is duplicated, and choosing precision in the chart from one week to one month represents one additional disk seek in another column.
Since we made a choice based on the event use case we decided to place our master data in Cassandra as well. Articles for retail industry were master data which represent a big object with lots of properties (size, color, name, description, barcode, price etc.). At that time, there were no extensive searching requirements (we needed articles only by unique identifier), so we decided to go with Cassandra as data store for master data as well.
Since in Cassandra you model data based on queries, the important thing to note is that you must change your data model as it evolves. When new requirements for querying data appear, the data model must be changed, or you will end up with a wrong or overly complicated data model. We learned that the hard way.
The first requirement appeared in relation to syncing with devices, where we needed information about the time of last update for each article and we needed to sync only the changed articles. Since we stored articles based on unique identifiers and updated information on each batch update, we needed a new table. The idea was to have articles by the times of their updates, so you can pass time and get all articles updated since that time. The articles were updated and we needed only the latest update (not the whole history), also we needed to page through that table without duplicates which caused the first problem, we needed to delete the previous entry from that table and keep only the last one. To do that, we needed the time of last update of the changed article and in order to do that we needed to do read before write (famous anti-pattern in cassandra). Doing this on a large scale was slow, time consuming and error prone. For each update we created many tombstones from delete insert and soon afterwards, in big updates we ran into a tombstones limit reached problem. In other words, our data model started to talk to us. Maintaining this monster logic of reading before writing became a nightmare, debugging, adding small stuff, every small step started to hurt. On the bright side, we at least became aware of the data model problem and we pushed the code as is, since this was a startup phase and change would be time consuming.
The second requirement really made us drop the whole Cassandra article storage, or at least partially. Our web application became overwhelmed with article information and we wanted to give the users a possibility to search, filter and narrow the list. Since articles have lots of attributes, this would provide many possibilities of combining queries (search by color, price, size, narrow search by brand, store etc.). This would be impossible to do in Cassandra, since that would mean duplicating articles in many tables, each by single criteria, and do intersections or unions on application level. This has triggered a new research regarding tackling both problems.
The second requirement made us lean more to indexing engines, particularly Solr and Elasticsearch. We were aware that Apache Spark SQL module could solve complicated queries at application level and it was powerful when placed over Cassandra, but we did not want to make a relational database of Cassandra. Our idea was to use Cassandra with a single table as permanent storage and to index articles for searching and syncing requirement. The core of our system is message queue so storing stuff in Elasticsearch was just a matter of adding one more subscriber to storing queue. Slowly, data started to be indexed in Elasticsearch which gave us searching possibilities in the distributed indexing engine, and we got rest API for free. Also, we could query the articles by their last update, so this solved the first problem as well. Basically, first we find the article, check hash over content of indexed article with hash over content of article which is coming to server, and if it is changed, reindex and update in Cassandra. Devices can sync by querying Elasticsearch API directly and this is bloody fast.
In SmartCat we always choose the right tool for the right job. Cassandra is a great DB, but not for everything. It works great with sensor data, time series data, and easy reads without too much query combinations. When there is a need to do read before write you are doing something wrong, so stop and rethink. With microservices and polyglot being the new normal, developers should not be afraid to use multiple databases on projects and to combine master data in one storage with events over that data stored in Cassandra where they fit perfectly.
Published at DZone with permission of Nenad Bozic, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.