Over a million developers have joined DZone.

Deploying Machine Learning and AI in the Real World Using Rendezvous Architecture

DZone's Guide to

Deploying Machine Learning and AI in the Real World Using Rendezvous Architecture

Read this article in order to learn more about deploying machine learning and AI in the real world using rendezvous architecture.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

There’s Fire Behind That Smoke

Machine learning and AI are getting a lot of press lately, but it it is more than just hype. Applications and systems that use ML/AI are generating serious amounts of value, often in very surprising applications. But using these new techniques to build systems doesn’t get you a free pass on basic engineering issues. To get a practical and sustainable value you will have to connect these systems to real problems whose solution has real business value. To succeed in this, you have to have a repeatable engineering process that can deploy these ML/AI models reliably.

In many ways, there are strong analogies with software development processes, but with important differences. You have to have version control (but it is different with data), continuous deployment (but this is trickier with models), and automated testing (but that is harder when we are learning a model rather than coding a program).

In general, building ML/AI systems breaks down into three major activities. These are 1) collecting data to train a model with, 2) training the model, and 3) deploying the model. Training the model is what most people and most courses focus on, but collecting data and deploying (and monitoring) models are at least as important. Let’s start with a look at the deployment process.

Rendezvous With Destiny

What are the challenges that need to be met for effective deployment in real business settings?

With ordinary software, you deploy new services after testing them, but you typically only have one version running at a time (except possibly during the upgrade process itself). That doesn’t really work with services based on machine learning for a number of reasons, the chief one being that it is important to monitor the operation of new models and compare them to existing models before committing to them fully. This is traditionally called a champion/challenger process. Even after a new challenger model has been approved as the new champion, it is common to keep the old champion around for a bit “just in case”. It is also common to keep a stable version of the model known as a canary model around for a very long time for comparison purposes. As a matter of fact, we don’t just want to keep these challengers, old champions and such around, we want to keep them actively evaluating all requests so that we can compare what how they are doing on every request that we get.

We recently developed an overall architecture called the Rendezvous Architecture that enables exactly this. One of the distinctive features of the rendezvous architecture is a very heavy, and possible counter-intuitive use of a streaming architecture. The basic idea is to deploy all models as microservices that handle all requests in a streaming style. This lets you have multiple pre-tested models already up and running, ready for deployment (essentially “waiting in the wings”). Here’s how: synchronous requests are received by a proxy which sends them to a stream of requests (called ‘input’ in Figure 1). All live models read these requests from that stream and write results back to a shared results stream (called ‘scores’ in the figure).

Image title

A simplified view of the Rendezvous Architecture showing the rendezvous server selecting one of many candidate results from multiple models.

At that point, a server known as the rendezvous server trades off a preference for the champion model versus response time requirements. If a new champion falls over or gets slow, the result from another model can be returned instead. Note that because all models start evaluating as soon as requests arrive, you don’t have to restart evaluation in case of a model failure. A really fast but less accurate model can be included as a backstop, as well, so you always have something to work with.

The effect of using the rendezvous architecture is that you can bring up new models at any time and watch how they behave for as long as you like. You can compare the behavior of challenger models against the behavior of the champion or older models in a completely realistic production environment. Once you are satisfied that you really want the new challenger to step up, you don’t have to change anything except the configuration of the rendezvous server; literally, all that happens is that the new champion’s results are no longer ignored.

The consequence of using a rendezvous architecture is that you get the benefits of continuous deployment, but with some wrinkles to address the special needs imposed by working with machine learning models. The rendezvous architecture also gives you the ability to monitor the operation of models (by comparing them to each other), or to detect changes in the kinds of queries being received (by looking for changes in the distribution of the outputs of the canary over time).

But Wait, There’s More! (Really)

Even though the rendezvous architecture focuses mostly on the deployment problem, it can help with data collection as well. Since all models get all the same requests, you can add a special model (known as a decoy) that produces no results, but simply archives every request that it sees. This archived data is guaranteed to be a faithful record of what the other models have seen, and getting that exactly correct is a big part of the data collection problem.

This means that using rendezvous to solve the deployment problem for ML/AI systems actually goes a good ways toward solving the data collection as well.

If you are planning to deploy a system that uses machine learning or AI, a rendezvous architecture may well be what you need.

You can read more about rendezvous architecture including how to do better model-to-model evaluations in the short book Machine Learning Logistics by me and Ellen Friedman, provided free online at https://mapr.com/ebooks/machine-learning-logistics/.

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

ai ,machine learning ,continuous deployment ,continuous integration ,monitoring ,microservices ,artificial intelligence ,rendezvous architecture

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}