5 Machine Learning Trends for 2018 Combined With Apache Kafka Ecosystem
Let's take a look at KSQL, ONNX, AutoML, and ML platforms from Uber and Netflix and see how they're related to each other.
Join the DZone community and get the full member experience.Join For Free
At the OOP 2018 conference in Munich, I presented an updated version of my talk about building scalable, mission-critical microservices with the Apache Kafka ecosystem and deep learning frameworks like TensorFlow, DeepLearning4J, or H2O. I want to share the updated slide deck and discuss a few updates about newest trends, which I incorporated into the talk.
The main story is the same as in my Confluent blog post about the Apache Kafka ecosystem and machine learning. But I focused more on deep learning/neural networks. I also discussed a few innovations in the ecosystem of Apache Kafka and trends in ML in the last months: KSQL, ONNX, AutoML, and ML platforms from Uber and Netflix. Let's take a look at these interesting topics and how they're related to each other.
KSQL: A Streaming SQL Language on Top of Apache Kafka
According to this article:
"KSQL is a streaming SQL engine for Apache Kafka. KSQL lowers the entry bar to the world of stream processing, providing a simple and completely interactive SQL interface for processing data in Kafka. You no longer need to write code in a programming language such as Java or Python! KSQL is open-source (Apache 2.0 licensed), distributed, scalable, reliable, and real-time. It supports a wide range of powerful stream processing operations including aggregations, joins, windowing, sessionization, and much more."
You can write SQL-like queries to deploy scalable, mission-critical stream processing apps (which leverage Kafka Streams under the hood) — definitely a highlight in the Kafka open-source ecosystem.
KSQL and Machine Learning
KSQL is built on top of Kafka Streams and therefore allows building scalable, mission-critical services. Machine learning models including neural networks are embeddable easily by building a User-Defined Function (UDF). I am preparing an example these days where I apply a neural network — more precisely, an Autoencoder — for sensor analytics to detect anomalies, i.e. critical values in health checks of hospital guests in real-time to send an alert to the doctor.
Let's now talk about some interesting new developments in the machine learning ecosystem.
ONNX: An Open Format to Represent Deep Learning Models
According to their website:
"ONNX is a open format to represent deep learning models. With ONNX, AI developers can more easily move models between state-of-the-art tools and choose the combination that is best for them."
This sounds similar to PMML (Predictive Model Markup Language, see this article for more details) and PFA (Portable Format for Analytics), two other standards to define and share machine learning models. However, ONNX differs in a few aspects:
- Focuses on deep learning.
- Has several huge tech companies (AWS, Microsoft, Facebook) and hardware vendors (AMD, NVidia, Intel, Qualcomm) behind it.
- Supports many leading open-source frameworks already (TensorFlow, Pytorch, MXNet).
ONNX is already GA in version 1.0 and production-ready (as announced by Amazon, Microsoft, and Facebook in December 2017). There is also a nice getting started guide for different frameworks.
ONNX and the Apache Kafka Ecosystem
Unfortunately, ONNX has no Java support yet. Therefore, there is no support yet for embedding it into the Kafka Streams Java API natively — only via a workaround like doing a REST call or embedding a JNI binding. But I am very sure this is only a matter of time because the Java platform is so important for many enterprises to deploy mission-critical applications.
Right now, you could use Kafka's Java API or other Kafka Clients. Confluent provides official clients for several programming languages, i.e. for Python or Go, which both are perfect for machine learning applications, too.
Automated Machine Learning (AKA AutoML)
As stated here:
"Automated machine learning (AutoML) is a hot new field with the goal of making it easy to select different machine learning algorithms, their parameter settings, and the pre-processing methods that improve their ability to detect complex patterns in big data."
With AutoML, you can build analytic models without any knowledge about machine learning. The AutoML implementations use different implementations of decision trees, clustering, neural networks, etc. to build and compare different models out-of-the-box. You just upload or connect your historical dataset and click a few buttons to start the process — maybe not perfect for every use case, but you can easily improve many existing processes without the need for a rare and expensive data scientist.
DataRobot and Google's AutoML are two of many well-known cloud offerings in this space. H2O's AutoML is integrated into its open-source ML framework but they also offer a nice UI-focused commercial product called Driverless AI. I highly recommend spending 30 minutes on any AutoML tool. It is really fascinating to see how AI tools develop these days.
AutoML and the Apache Kafka Ecosystem
Most AutoML tools offer deployment of their models. You can access the analytic models, i.e. via a REST interface — not a perfect solution for a scalable, event-driven architecture like Kafka. The good news: many AutoML solutions also allow to export their generated models so that you can deploy them into your application. For example, AutoML in H2O's open-source frameworks is just one of many options. You only use another operation in the programming language of your choice (R, Python, Scala, Web UI):
aml <- h2o.automl(x = x, y = y, training_frame = train, leaderboard_frame = test, max_runtime_secs = 30)
This is similar to what you would do to build a linear regression solution, decision tree, or neural network. The result is generated Java code that you can easily embed into your Kafka Streams microservice or any other Kafka application. AutoML enables you to build and deploy highly scalable machine learning without deep knowledge of ML.
ML Platforms: Uber's Michelangelo; Netflix's Meson
Tech giants are typically some years ahead of "traditional enterprises." They already built years ago what you would build today or tomorrow. ML platforms are no different. Writing the ML source code to train an analytic model is just a very small part of a real-world ML infrastructure. You need to think about the whole development process. The following picture shows the hidden technical debt in machine learning systems:
You will probably build several analytic models with different technologies. Not everything will be built in your Spark or Flink cluster or in a single cloud infrastructure. You might run TensorFlow on some big, expensive GPU in the public cloud to build powerful neural networks. Or you might use H2O to build some small but very efficient and performant decision trees that do inference in a few microseconds. ML has many use cases.
That's why many tech giants have built their own ML platforms, like Uber's Michelangelo or Netflix's Meson. These ML platforms allow them to build and monitor powerful, scalable analytic model while being flexible to choose the right ML technology for each use case.
Apache Kafka Ecosystem for ML Platforms
One of the reasons why Apache Kafka is so successful is its huge adoption by many tech giants. Almost all great Silicon Valley companies, including LinkedIn, Netflix, Uber, and eBay, speak about their usage of Kafka as an event-driven central nervous system for their mission-critical applications. Many focus on the distributed streaming platform for messaging. But we also see more and more adoption of add-ons like Kafka Connect, Kafka Streams, REST Proxy, Schema Registry, and KSQL.
Look at the above picture again and think about Kafka: Isn't it a perfect fit for an ML platform? It does training, monitoring, deployment, inference, configuration, A/B testing, etc. That's probably why Uber, Netflix, and many others use Kafka already as a central component in their ML infrastructure.
And again, you are not forced to use just one specific technology. One of the great design concepts of Kafka is that you can re-process data again and again from its distributed commit log. This means you can either build different models with one technology as a Kafka sink (let's say Apache Flink or Spark) or connect different technologies like scikit-learn for local testing, TensorFlow running on Google Cloud GPUs for powerful deep learning, an on-premise installation of H2O nodes for AutoML, and some other Kafka Streams ML apps deployed in Docker containers or Kubernetes. All of these ML applications consume the data in parallel at their own pace and however often they need to.
Here is a great example of how to automate training and deployment of a scalable ML microservice with Kafka and Kafka Streams. There's no need to add another big data cluster. That's one of the key differences of using Kafka Streams or KSQL for your ML applications instead of other stream processing frameworks.
Apache Kafka and Deep Learning: Slide Deck From OOP
Finally, after all these discussions about the Apache Kafka ecosystem and new trends in Machine Learning / Deep Learning, here are my updated slides from my talk at OOP 2018 conference:
I have also built a few examples using Apache Kafka, Kafka Streams and different open-source ML frameworks like H2O, TensorFlow, and DeepLearning4j (DL4J). The GitHub project shows how easy it is to deploy analytic models to a highly scalable, fault-tolerant, mission-critical Kafka microservice. A KSQL demo will also come soon.
Please share your feedback. Do you already use Kafka in the machine learning space? What components in addition to Kafka core do you use? Feel free to contact me to discuss this in more detail.
Published at DZone with permission of Kai Wähner, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.