Machine Learning in Software Development — Techniques and Tools

The ability to version-control ML models, automate testing, and provide better feedback.

Tom Smith

CORE ·

Sep. 26, 19 · Analysis

Likes (2)

Comment

Save

9.3K Views

Machine learning techniques and tools

To learn about the current and future state of machine learning (ML) in software development, we gathered insights from IT professionals from 16 solution providers. We asked, "What machine learning techniques and tools are most effective for the SDLC?" Here's what we learned:

Tools

MLFlow, Bugspots, Helium, and Appvance are some pretty powerful tools. I particularly like MLFlow for its ease of use and ability to version-control ML models.
We adopted MLFlow for our data platform — ML data platform management system. Operational database real-time and transactional for in-database ML to track the workflow of the data scientists. If you adopt a culture of experimentation, create 50 experiments a day, each running and producing a different result, you need to keep track of each. You need the ability to tag with parameters and metrics so you can go back and see why one model performed better than another.
We’re building those tools as part of our platform. Open source tools like SciLearn, Pytorch, TensorFlow, and build our own.
A lot of the new modern test automation tools allow you to have self-healing tests, automated tests, and automated crawlers to find bugs. Logging systems to find anomalies for security alerts. Most of the focus is around maintenance.
Tools simplify infrastructure and data engineering for developers. With ML an explosion of things needs to happen. Easy integration into the application. Debugging is more difficult because the ML modes are living entities and drift occurs as data and learning changes. The biggest challenge is the debuggability of code and application. Make sure you have the traceability of your model decisions. Model performance evaluation over time.

Feedback

The most effective technique is to define the task at hand as clearly as possible and immediately come up with an automatic evaluation method. Following this step, you ought to collect and label a small dataset for your problem, overfit to that dataset with any method, and try to close the whole production loop: dataset collection - training - evaluation - deployment. A majority of the time you’ll realize that your evaluation method is actually not what you had intended for your product, causing you to have to go through these stages again.
The answer for everything is DevOps but a better answer is thinking in terms of providing useful feedback loops. We tend to focus on ceremony and mechanics without instrumenting ops in a way that a dev finds value from the metrics. To prevent analysis paralysis, including ML on the ops level to give developers the information they need. Want anomaly rates that diverge from projections. Build anomaly detection models based on code. Ops is creating better feedback data for developers.
Python by default is the language for scripting the frameworks. There are a lot of models that can be used, or you can build your own. Reinforcement learning (Deep adversarial, Q), semi-supervised and using Closed-loop ML techniques have proven to be beneficial in different phases of SDLC. When organizations build models, the underlying premise is that the model’s accuracy and efficiency are based on certain assumptions and is dependent on the training data set it is privy to. If there is a change in data patterns or unanticipated scenarios, the model’s accuracy and efficiency may diminish over time. For example, in a manufacturing plant, a model can be deployed to detect defects on parts being manufactured and assembled in the assembly line. Over time, the model’s ability to accurately identify the errors may diminish. This results in severe challenges if the software uses traditional analytics exclusively. However, when equipped with closed-loop functionalities, the smart agents can auto-detect and trigger a re-learning and re-training process to improve the accuracy and performance of the models automatically, leading to increased productivity, efficiency and cost-savings. The closed-loop ML technique for the SDLC can use a reinforcement or unsupervised algorithms to train, test and validate ML models to improve accuracy. Post the initial deployment, as needed, the model can self-learn, self-adjust and detect variations in its own accuracy and performance. In short, it will tune itself so that the output is optimal.

Other

ML is becoming standardized across the SDLC — people are learning how to use it, getting vision into where things are going, and becoming more distributed.
We're seeing more around deep learning and specific ML methods.
It depends on the business case. Classic data science is needed to understand the right algorithm and ensure data management. You may need to choose a model that’s almost as good but computationally less expensive. Incorporate a desirability function to consider the cost of planning and deployment.
Techniques I am seeing include learning techniques such as concept learning, decision trees, neural networks (and convolutional neural networks), if/then rules, reinforcement learning, inductive logic programming, and the like.
Here are the main elements:
- 1) Ensuring business requirements and expectations are set from the beginning. This helps define the ROI for the project and what you’re looking to solve for (i.e., better customer engagement, reduce churn, etc.).
- 2) Converting the business problem into a technical problem. This lets you define what data is needed, the approach, where to start, etc. so you can set the scope of the solution. You take the business problem of improving customer satisfaction or gaining market share and you turn it into a data science problem: prediction for customer conversion/customer churn, user segmentation, product recommendation, etc. which is something that you can solve for using data and a model. 3) Establish what data is actually available to solve the problem. This can be one of the biggest limiting factors of applying ML in the SDLC. There needs to be sufficient and relevant data to solve the problem, and there needs to be a base level of normalization. Given the technical problem, you need to identify which entities can be relevant features to plug into the model. 4) Design the rotation process. Given your toolkit, start with the simplest approach possible and see how it performs. Based on those results, you have a sense of direction for where to go and how to add complexity. 5) Experimentation and Quality: Design experiments so you can test performance, make modifications, re-evaluate, then rinse and repeat. Make sure you pick the right metrics, so you measure what really matters.

Here’s who we heard from:

Dipti Borkar, V.P. Products, Alluxio
Adam Carmi, Co-founder & CTO, Applitools
Dr. Oleg Sinyavskiy, Head of Research and Development, Brain Corp
Eli Finkelshteyn, CEO & Co-founder, Constructor.io
Senthil Kumar, VP of Software Engineering, FogHorn
Ivaylo Bahtchevanov, Head of Data Science, ForgeRock
John Seaton, Director of Data Science, Functionize
Irina Farooq, Chief Product Officer, Kinetica
Elif Tutuk, AVP Research, Qlik
Shivani Govil, EVP Emerging Tech and Ecosystem, Sage
Patrick Hubbard, Head Geek, SolarWinds
Monte Zweben, CEO, Splice Machine
Zach Bannor, Associate Consultant, SPR
David Andrzejewski, Director of Engineering, Sumo Logic
Oren Rubin, Founder & CEO, Testim.io
Dan Rope, Director, Data Science and Michael O’Connell, Chief Analytics Officer, TIBCO

Machine learning Software development Data science

Opinions expressed by DZone contributors are their own.

Related

Trending