At Spark Summit in San Francisco, we highlighted our PowerStream showcase application, which processes and analyzes data from more than 2 million sensors on 200,000 wind turbines installed around the world. We sat down with one of our PowerStream engineers, John Bowler, to discuss his work on our integrated MemSQL and Apache Spark solutions.
What is the Relationship Between MemSQL and Spark?
At its core, MemSQL is a database engine, and Spark is a powerful option for writing code to transform data. Spark is a way of running arbitrary computation on data either before or after it lands in MemSQL.
The first component to MemSQL and Spark integration is the MemSQL Spark Connector, an open-source library. Using the connector, we are able to use Spark as the language for writing distributed computations, and MemSQL as a distributed processing and storage engine.
For those familiar with Spark, here is how the MemSQL Spark Connector allows tight integration between MemSQL and Spark:
MemSQLContext.sql("SELECT * FROM t"), you can create a DataFrame in Spark that is backed by a MemSQL table. When you string together a bunch of SparkSQL operations and call collect() on the result, these DataFrame operations will actually run in the MemSQL database engine as direct SQL queries. This can give a major performance boost due to the SQL-optimized nature of MemSQL.
df.saveToMemSQL(), you can take a DataFrame and persist it to MemSQL easily
The second component to MemSQL and Spark integration is Streamliner, which is built on top of the Spark Connector. Streamliner enables you to use Spark as a high-level language to create Extract, Transform, Load (ETL) pipelines that run on new data in real time.
We built Streamliner around a ubiquitous need to ingest data as fast as possible and query the information instantly. With Streamliner, you can write the logic of your real-time data analytics pipeline such as parsing documents, scoring a machine-learning model, or whatever else your business requires, and instantly apply it to your MemSQL cluster. As soon as you have raw analytics data available for consumption, you can process it, see the results in a SQL table, and act on it.
What Type of Customer Would Benefit From the MemSQL Streamliner Product?
A customer who is already using Kafka to collect real-time information streaming from different sources can use Streamliner out-of-the-box. Without writing any code, you can take all the data in a Kafka topic and append it to a MemSQL table instantly. MemSQL will automatically place this in a JSON format by default so no additional work is required. However, if you want to take semi-structured or unstructured "messages" and turn them into "rows" for MemSQL, you can write arbitrary code in the Streamliner "Transform" step. Streamliner also allows you to do this inside the web browser console.
Consider this example — suppose you want to make a dashboard that will monitor data from your entire company and produce real-time visualizations or alerts. Your production application is inserting into a production database, emitting events, or outputting logs. You can optimize this dashboard application by taking all of this data and routing it to a distributed message queue such as Kafka, or writing it directly to a MemSQL table. You can then write your data-transformation or anomaly-detection code in Spark. The output of this is data readily available in MemSQL for any SQL-compatible Business Intelligence tool, your own front-end application, or users in your company running ad-hoc queries.
What is PowerStream?
PowerStream is a showcase application that we built on top of Streamliner. It's an end-to-end pipeline for high-throughput analytics and machine learning.
We have a simulation of 20,000 wind farms (200,000 individual turbines) around the world in various states of disrepair. We use this simulation to generate sample sensor data, at a rate of 1-2 million data points per second. Using a co-located Kafka-Spark-MemSQL cluster, we take these raw sensor values and run them through a set of regression models to determine 1) how close each turbine is to failing, and 2) which part is wearing down.
What is the Most Interesting Part of PowerStream?
I am personally interested in the data science use case. PowerStream demonstrates how we can deploy a machine learning model to a cluster of nodes and "run" the model on incoming data, writing the result to MemSQL in real time.
Data science is a big field, and running machine learning models in production is an important part, but of course not the whole picture. Data exploration, data cleaning, feature extraction, model validation — both interactively (offline) and in production (online) — are all parts of a complete data science workflow.