Over a million developers have joined DZone.

QA: Why Machine Learning Systems Are Non-Testable

DZone's Guide to

QA: Why Machine Learning Systems Are Non-Testable

In this post, you will learn about the aspects related to why Machine Learning systems/models are non-testable.

· AI Zone ·
Free Resource

Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.

This post represents views on why Machine Learning systems or models are termed as non-testable from quality control/quality assurance perspectives. Before I proceed, let me humbly state that data scientists and the Machine Learning community have been saying that ML models are testable as they are first trained and then tested using techniques such as cross-validation etc. based on different techniques to increase the model performance and optimize the model. However, "testing" the model is referred with the scenario during the development (model building) phase when data scientists test the model performance by comparing the model outputs (predicted values) with the actual values. This is not the same as testing the model for any given input for which the output (expected) value is not known beforehand. In this post, I am rather talking about ML models testability from the overall traditional software testing perspective.

Given that Machine Learning systems are non-testable, it can be said that performing QA or quality control checks on Machine Learning systems is not easy, and, thus, a matter of concern given the trust, the end-users need to have on such systems. Project stakeholders must need to understand the non-testability aspects of Machine Learning systems in order to put appropriate quality controls in place to serve trustable Machine Learning models to end users in production. This applies greatly to healthcare and financial systems where a couple of false negatives or type-II error could lead to havoc or troubles for the stakeholders.

This, in a way, presents an opportunity for AI/Data Science/Machine Learning community to work on creating frameworks to enable /achieve testability of ML systems from QA perspectives. As a matter of fact, there are frameworks such as LIME that could play a key role in achieving testability of ML systems. I will be digging deeper and posting my research work in this field in the next 3-6 months. Stay tuned!


A software application needs to go through many quality control (QC) checks (testing) as part of quality assurance (QA) practices before it is moved into production for the consumption of end users. It is easier to perform QC checks/testing on software applications as the outputs for different classes of inputs can be verified against expected values which are known prior to starting testing. This is also termed as test oracle, which we will discuss later in this post.

For Machine Learning (ML) systems comprising of Machine Learning / predictive models, there are no well-defined expected values against which the outputs can be verified and said to be correct or incorrect. In other words, the test oracle is not clearly defined for performing testing on ML systems.

And this is why Machine Learning systems can be termed as non-testable. We will see the details later in this post.

What Is a Test Oracle?

In testing of software applications, a frequently invoked assumption is that there are testers or external mechanisms, such as automated software tests (unit tests/integration tests) that could accurately determine whether or not the output produced by the program/software apps is correct. These testers or automated software tests are termed as Oracle or Test Oracle. And, the assumption or belief that testing mechanisms can accurately determine the program correctness based on input-output is called as oracle assumption.

A software/program can be termed as non-testable in the following scenarios:

  • A test oracle (testers/test programs) does not exist because the correctness of the program output can't be verified against the expected value, maybe because, the expected values are not well defined in the first place.
  • The testers must expend some extraordinary amount of time & effort to determine whether or not the output is correct; Or, the test programs become very, to execute and maintain in order to test the program correctness at regular intervals.

In case the testers or test mechanisms could state whether the program output is correct or not without knowing the correct answer is termed as a partial oracle.

Why Are ML Models/Systems Non-testable?

As defined earlier, a program is said to be non-testable in absence of a test oracle, which is nothing but the testers/test mechanisms that could be used to verify the correctness of program outputs.

Machine Learning programs or models fall under the category of non-testable programs. The following represents thoughts in relation to why Machine Learning programs are termed as non-testable programs:

  • Unlike traditional software apps where outputs in form of expected values are known beforehand, the outputs of Machine Learning models are predictive in nature. This means there are no expected values beforehand, rather, the output is predicted as a result of execution of Machine Learning models fed with a given set of input values. Only experts can tell whether the prediction made by the model given a set of input values is correct or not.
  • Machine Learning models built with a specific algorithm when optimized with techniques such as cross-validation or grid search could be given improved results/outputs. This makes it difficult to test because the same set of input values when fed into optimized models could give different outputs. Let's take a look at an example where classifiers are built using different algorithms to predict the quality of the red wine. Here is the Kaggle project on predicting the quality of the wine. Pay attention to some of the following, which reflect on non-testability of the models given the different outputs possible with optimized models:
    • Support vector classifier model is built to classify the quality of the red wine. The precision value was found to be 0.86 and recall value of 0.88. The model is optimized using a grid search technique. Later, the precision value was found to be 0.90 and recall value of 0.90.
    • Random forest classifier model was trained to classify the quality of the wine. The precision and recall value was found to be 0.87 and 0.88 respectively. Later, the classifier was optimized with cross-validation technique and the accuracy (precision value) improved to 91%. The above represents the challenges that testers (oracle) could face in determining the correctness of the model given that models in two different scenarios (non-optimized and optimized) produces different outputs.
  • Machine Learning models built with different algorithms give different results based on the accuracy of the models. Models built with random forest, stochastic gradient descent and support vector classifier have different accuracy in terms of precision value such as 87%, 84%, and 86% respectively. This represents the challenges for the test oracle to determine the correctness of the outcome given the input values as the same set of input values fed into models built with different algorithm could give different output values (prediction).

Thoughts on Making ML Models Testable

Given that ML models are non-testable due to the absence of test oracle, let's look at some of the ways (pseudo-oracle) that could be used to perform quality control checks on the Machine Learning models in some ways or the other. This is not an exhaustive list by any chance. I would be posting research findings in later posts in the coming weeks/months.

  • Dual-coding technique for quality control checks of Machine Learning models: Build multiple models using different algorithms. In the above example, models using Random Forest, Stochastic Gradient Descent and Support vector classification (SVC) algorithms were built to predict the quality of the wine. Say, based on the performance, random forest classifier got accepted as the final model which will be moved to production. However, in a QA environment, another model with second best accuracy, say, Support Vector Classifier. In case, the prediction made by two of these models are different, an alert is raised for QA/Data Scientists to validate the result.
  • Compare ML model outcome with that of a Simplified Linear Model: Build a simplified linear model (less complex) model which could be used to compare the prediction of the actual model with that of a simplified model.
  • Metamorphic testing technique for quality control checks of ML models: In case, the predictions (output values) for a known set of input values can be compared based on the relationship between input-output variables, the model could be fed with a known set of inputs and the output values could be evaluated appropriately for the correctness. I would go into details in one of the posts in near future.


In this post, you learned about the aspects related to why Machine Learning systems/models are non-testable. Given this, if you are part of QA team or a data scientist and you can not find specialized QA practices to perform quality control checks of Machine Learning models, reach out to stakeholders in your company and get started on this. Please feel free to reach out to me and to leave your thoughts in the comments section. 

Your machine learning project needs enormous amounts of training data to get to a production-ready confidence level. Get a checklist approach to assembling the combination of technology, workforce and project management skills you’ll need to prepare your own training data.

machine learning ,software testing ,qa ,artificial intelligence ,quality control checks

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}