MemSQL 6: Product Pillars and Machine Learning Approach
MemSQL 6: Product Pillars and Machine Learning Approach
MemSQL 6 was recently released and it includes a range of new machine learning capabilities, closing the gap between data science and operational applications.
Join the DZone community and get the full member experience.Join For Free
Today marks another milestone for MemSQL as we share the details of our latest release, MemSQL 6. This release encapsulates over one year of extensive development to continue making MemSQL the best database platform for real-time analytics with a focus on real-time data warehouse use cases.
Additionally, MemSQL 6 brings a range of new machine learning capabilities to MemSQL, closing the gap between data science and operational applications.
MemSQL 6 has three foundational pillars:
- Query performance
- Enhanced online operations
Let's explore each of these in detail.
Extensibility covers the world of stored procedures, user-defined functions (UDFs), and user-defined aggregates (UDAs). Together, these capabilities represent a mechanism for MemSQL to offer in-database functions that provide powerful custom processing.
For those familiar with other databases, you may know of PL/SQL (Procedural Language/Structured Query Language), developed by Oracle, or T-SQL (Transact-SQL), jointly developed by Sybase and Microsoft. MemSQL has developed its own approach to offering similar functions with MPSQL (Massively Parallel Structured Query Language).
MPSQL takes advantage of the new code generation that was implemented in MemSQL 5. Essentially, we are able to use that code generation to compile MPSQL functions. Specifically, we implement native machine code for stored procedures, UDFs, and UDAs in-lined into the compiled code that we generate for a query.
Long story short, we expect MPSQL to provide a level of peak performance not previously seen with other databases' custom functions.
MemSQL extensibility functions are also aware of our distributed system architecture. This innovation allows for custom functions to be executed in parallel across a distributed system, further enhancing overall performance.
Benefits of extensibility include the ability to centralized processes in the database across multiple applications, the performance of embedded functions, and the potential to create new machine learning functions as detailed later in this post.
Query Processing Performance
MemSQL 6 includes breakthrough improvements in query processing. One area is through operations on encoded data. MemSQL 6 includes dictionary encoding, which can translate data into highly compressed unique values that can then be used to conduct incredibly fast scans.
Consider the example of a public dataset about every airline flight in the United States from 1987 until 2015, as outlined in our blog post on delivering scalable self-service analytics.
With this dataset MemSQL can encode and compress the data, allowing for extremely rapid scans of up to one billion rows per second per core.
MemSQL 6 also makes use of improvements to the Intel advancements with Single Instruction, Multiple Data (SIMD). This technique allows the CPU to complete multiple data operations in a single instruction, essentially vectorizing and parallel processing the query.
The benefits of these query processing advancements include having a detailed data view without needing to pre-process the data. This further allows for interactive analysis on raw, unaggregated data, providing the most up-to-date and accurate query results possible.
Enhanced Online Operations
To power mission-critical applications, data platforms must be online all the time, and with MemSQL 6 we have enhanced our ability for MemSQL to operate online. This includes broader online coverage for DDL operations and the fact that any node can perform DDL operations.
The benefits of these improvements include more sophisticated monitoring and recovery, easier application development, and improved overall availability.
Machine Learning and MemSQL 6
MemSQL 6 helps close the gap between machine learning and operational applications in three areas:
- Built-in machine learning functions
- Real-time machine learning scoring
- Machine learning in SQL with extensibility
Built-In Machine Learning Functions
MemSQL 6 includes new machine learning functions like
DOT_PRODUCT, which can be used for real-time image recognition but also for any application requiring the comparison of two vectors. While this function itself is not new in the world of machine learning, MemSQL now delivers this function within its distributed SQL database, enabling an unprecedented level of performance and scale.
For more information, check out this blog post, An Engineering View on Real-Time Time Machine Learning.
Real-Time Machine Learning Scoring
MemSQL includes the ability to manage real-time data pipelines with custom transformations at ingest. This transformation can also deliver the execution and scoring using a machine learning model. For example, you may choose to take a machine learning model from SAS and export it using PMML, the predictive modeling markup language.
This allows real-time scoring on ingest and co-locating the raw data and the instant score next to each other in the same row in the same table. This simple structure sets a foundation for easy predictive analytics.
Enabling Machine Learning in SQL with Extensibility
The new MemSQL extensibility functions also enable a new approach to machine learning directly in SQL. This can dramatically shorten the gap between data science and production applications as operations occur on the live data, and models can be trained and updated to incorporate and reflect the most recent data.
We recently showcased an example of this with k-means clustering by simply using native SQL and MemSQL. You can see the presentation here on Slideshare.
Taking Machine Learning Real-Time
With the new features of MemSQL 6 including extensibility and query performance, we expect more machine learning applications to incorporate MemSQL as the persistent data store.
The MemSQL architecture is well suited to work in conjunction with other machine learning systems, and real-time data pipelines. For example, MemSQL includes:
- A distributed, scale-out architecture well-suited to performance and large-scale workloads
- An open-source MemSQL Spark Connector for high-throughput, highly-parallel, and bidirectional connectivity to Spark
- Native integration with Kafka message queues including the ability to support exactly-once semantics
- Full transactional SQL semantics so you can build production applications for the front lines of your business
Together, we see these capabilities as foundational for real-time machine learning workloads, and we invite you to try the latest version of MemSQL today here.
Published at DZone with permission of Gary Orenstein , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.