The Fundamental Differences Between ML Model Development and Traditional Enterprise Software Development
In this post, we’ll talk about key differences between traditional enterprise software development and ML model building and offer some ML lifecycle management tips.
Join the DZone community and get the full member experience.Join For Free
Academic literature on machine learning modeling does not explicitly address how enterprises across industries can utilize ML algorithms. And many companies, even after investing in foundational ML tools, still often get puzzled when defining business use cases for their AI apps, customizing general purpose machine learning models for domain-specific tasks, converting business requirements into data requirements, etc.
In this post, we’ll talk about key differences between traditional enterprise software development and ML model building and offer some ML lifecycle management tips (chiefly concerning data preparation and feature engineering) for those seeking to harness AI.
Enterprise Software Development vs. ML Model Building
In traditional software development we write out explicit instructions for a computer to follow and, therefore, the applications we end up with are deterministic.
In machine learning, which is probabilistic in nature, we rely on data to write our if-then statements. We feed data to our algorithms to get our models to find underlying patterns within datasets and figure out formulas for mapping inputs to outputs.
The data management pipeline in ML typically consists of acquiring a dataset, checking its quality, cleaning and sampling it, and, in some cases, applying augmentation procedures such as data synthesis. These stages all have their own lifecycles and to execute them properly companies need disciplined management; they need to integrate specific tools (for data governance, data lineage management, etc.) with the big data solutions they’re already using so as not to end up with chaotic models.
How Enterprises Can Make Machine Learning Work for Them
Here’s a brief introduction to AI service development steps, which are all iterative and require continuous improvement.
#1 Defining Goals
After the problem is clearly defined, and we’re certain that ML techniques can help us resolve it, data scientists will kick things off by proposing several model design options to experiment with.
We’ll start with an explicit description of what the model is expected to do and try to translate, as specifically as possible, business requirements into data requirements.
Having focused and measurable objectives will help put everyone on the same page; your data scientists will understand better what types of data to acquire and how to set train/dev/test splits, etc.
#2 Preparing Data
The goal of this step is to obtain a dataset that’s diverse, representative, and unbiased.
It’s a good idea to have a person within the firm whose responsibility is identifying suitable sources of data, ensuring the legality of data, and handling negotiations with data vendors.
The processes of data selection and data cleansing, which are essential to proper data preparation, might substantially reduce your intitial dataset, but they'll also help prevent your models from picking up (and amplifying) undesirable biases from data.
As far as annotation goes, consider seeking some help from outside; labeling properly each and every data object in your dataset when all you have is a small internal team of human annotators might be too overwhelming a task. Pre-annotation (using machine learning) can substantially speed up the process.
In some cases, when real-wold data is difficult to obtain, synthetic augmentation might be implemented to expand the dataset. If we're talking about image recognition, for instance, rotating and tilting pictures or changing their colors allows us to create new training samples.
#3 Representing Data
After selecting, cleaning, transforming and enriching our data, we can start working on feature selection and feature engineering.
Feature selection allows us to figure out which features in the training data are the most useful (for the task at hand) and get rid of the unnecessary ones. Having fewer features helps simplify the learning problem and allows us to avoid the curse of dimensionality.
Generally, the algorithms used to tackle the selection problem do one of the following things:
Filtering — when we input a set of features into an algorithm that maximizes some criteria and outputs a reduced number of features, without interacting with the learner (the criteria of usefulness is built inside the search algorithm.)
Wrapping — when a search algorithm sends the set of features we’ve put into it to a learning algorithm, which then informs it about their performance. In this case, the criteria is built inside the learner and the two algorithms work conjointly.
Wrapping methods tend to perform better but they are far more computationally expensive. Filtering algorithms, on the other hand, generally can't provide highly accurate results, but it is way easier to compute them.
Feature engineering, broadly speaking, refers to generating new features from the existing ones (extracted from raw data) to enhance the model’s performance. In other words, it’s a process of applying thorough domain knowledge to find creative ways of representing clearly the problem we’re trying to solve to the predictive model (this could mean that instead of simple variables a,b we’ll use something like log(a) - sqrt(b),)
The main benefit of using expert-crafted features is that they can help even light, simple models to produce good results; they help the models to approximate faster and thus save us both time and money.
However, the approach does have some drawbacks too:
- We need experts who understand how our features interact with each other;
- The task, which often includes thousands of experiments to test various hypotheses, might turn out extremely time-consuming;
- Hand-crafted features are brittle and typically do not scale well in practical situations; they can only be applied in a specific context;
#4 Model Training and Testing
Next, we move on to evaluating different ML models to find out which one makes the most sense to use considering our target variable. This is where we make decisions about which frameworks to use (PyTorch, Keras, etc.) and experiment with various architectures. If deep learning is involved, we have to decide how many hidden layers our neural net will have and what kind of pooling operations and activation functions we will put into it.
During training, it’s quite common to apply a statistical method called K-Fold cross-validation to detect the best-performing model and avoid overfitting. The gist of the procedure is segmenting the training set into a number of groups (k) and holding out one of them during training; we will remove a part of our training data and have the model make predictions on it as if it’s a test dataset, and we’ll repeat this as many times as we have data segments (called folds) and use a different fold to test the model on each time. We’ll then estimate the error by averaging the results over all trials.
After training, our finalized model will be tested on various datasets and its performance, in terms of runtime and output quality, will be compared against competitors’ services (if available). Every failed use case will be analyzed thoroughly so that we know how to reduce errors during future iterations.
#5 Deploying the Model
Up to this point, we’ve been investing time and resources into building and testing our model, without returns. Now, we can finally deploy it into production and make it useful for the enterprise. The data scientists and the application development team have to work together on this to find an optimal deployment configuration (infrastructure elements, memory, disk, GPU/CPU power, etc.) The end goal is integrating the ML model into your organization's infrastructure in a way that will allow other applications to query it for predictions without undue delays.
While some companies prefer to translate ML models, which are mostly written in Python, to C++ (for example) and redeploy them to ensure seamless communication, the process of rewriting everything could take a few weeks which would create a huge gap between the testing and deployment phases. A speedier approach is deploying the model as a Python application serving a REST API.
It’s also recommended to save the model into a database to preserve its lineage; we must record the time of deployment and the model version.
ML lifecycle is unique and typical guidelines we use in enterprise software engineering such as SLDC, CMM, ALM can’t help us manage efficiently the development of AI services.
Companies seeking to capitalize on AI need to adopt new tools, processes, and skill sets; they must invest in foundational data acquiring and data governance technologies, and hire teams capable of not only training, testing, and deploying ML models but, also, handling their continuous improvement.
Opinions expressed by DZone contributors are their own.