Massive Data Ingest and Concurrent Analytics With MemSQL
As we create ever vaster stores of data and, as a resul,t require ever faster processing of that data, we will need to be smart about how and where we process it.
Join the DZone community and get the full member experience.Join For Free
the amount of data created in the past two years surpasses all of the data previously produced in human history. even more shocking is that for all of that data produced, only 0.5% is being analyzed and used. in order to capitalize on data that exists today, businesses need the right tools to ingest and analyze data.
the first step in achieving this is ingesting large volumes of data at incredible speed. the distributed nature of the memsql environment makes it easy to scale up to petabytes of data! some customers use memsql to process 72tb of data a day, or more than 6 million transactions per second, while others use it as a replacement for legacy data warehouse environments.
memsql offers several key features for optimizing data ingest, as well as supporting concurrent analytics:
memsql enables high throughput on concurrent workloads. a distributed query optimizer evenly divides the processing workload to maximize the efficiency of cpu usage. queries are compiled to machine code and cached to expedite subsequent executions. rather than cache the results of the query, memsql caches a compiled query plan to provide the most efficient execution path. the compiled query plan does not pre-specify values for the parameters, which allows memsql to substitute the values upon request, enabling subsequent queries of the same structure to run quickly, even with different parameter values. moreover, due to the use of multi-version concurrency control (mvcc) and lock-free data structures, data in memsql remains highly accessible, even amid a high volume of concurrent reads and writes.
query execution architecture
memsql has a two-tiered architecture consisting of aggregators and leaves. aggregators act as load balancers or network proxies, through which sql clients interact with the cluster. aggregators store metadata about the machines in the cluster and the partitioning of the data. in contrast, leaves function as storage and compute nodes.
the highly scalable distributed system allows clusters to be scaled out at any time to provide increased storage capacity and processing power. sharding occurs automatically, and the cluster re-balances data and workload distribution. data remains highly available and nodes can go down with no effect on performance.
in addition to being fast, consistent, and scalable, memsql persistently stores data. transactions are committed to disk as logs and periodically compressed as snapshots of the entire database. if any node goes down, it can restart using one of these logs.
in-memory and on-disk storage
memsql supports storing and processing data with an in-memory rowstore, or a memory or disk-based columnstore. the rowstore works best for optimum performance in transactional workloads because of the sheer speed of in-memory processing. the columnstore operates best for cost-effective data storage of large amounts of historical data for analysis. a combination of the rowstore and columnstore engines allows users to analyze real-time and historical data together in a single query.
in 2015, memsql introduced streamliner , an integrated apache spark solution. streamliner allows users to build real-time data pipelines. it extracts and transforms the data through apache spark, and loads it into memsql to persist the data and serve it up to a real-time dashboard or application.
streamliner comes with a versatile set of tools ranging from development and testing applications to personalization for managing multiple pipelines in production. you can use streamliner through the spark tab in the memsql ops web interface and through the memsql ops cli.
in addition to saving time by automating much of the work associated with building and maintaining data pipelines, streamliner offers several technical advantages over a home-rolled solution built on spark:
streamliner provides a single unified interface for managing many pipelines, and allows you to start and stop individual pipelines without affecting other pipelines running concurrently.
streamliner offers built-in developer tools that dramatically simplify developing, testing, and debugging data pipelines. for instance, streamliner allows the user to trace individual batches all the way through a pipeline and observe the input and output of every stage.
streamliner handles the challenging aspects of distributed real-time data processing, allowing developers to focus on data processing logic rather than low level technical considerations. under the hood, streamliner leverages memsql and apache spark to provide fault tolerance and transactional semantics without sacrificing performance.
the modularity of streamliner, which separates pipelines into extract, transform, and load phases, facilitates code reuse. with thoughtful design, you can mix, match, and reuse extractors and transformers.
out of the box, streamliner comes with built-in extractors, such as the kafka extractor, and transformers, such as a csv parser and json emitter. even if you find you need to develop custom components, the built-in pipelines make it easy to start testing without writing much or any code up front.
get started with memsql community edition
with these extensive capabilities for massive data ingest and analytics, memsql provides a robust solution for large influxes of data from iot sources, business transactions, applications, and a variety of new sources cropping up today. try it for yourself today.
Published at DZone with permission of Dale Deloy, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.