DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • How Can Artificial Intelligence Transform Software Testing?
  • AI, ML, and Data Science: Shaping the Future of Automation
  • The Battle of Data: Statistics vs Machine Learning
  • Optimizing Data Management for AI Success: Industry Insights and Best Practices

Trending

  • Four Essential Tips for Building a Robust REST API in Java
  • Navigating the LLM Landscape: A Comparative Analysis of Leading Large Language Models
  • Ensuring Configuration Consistency Across Global Data Centers
  • Next-Gen IoT Performance Depends on Advanced Power Management ICs
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. How to Test AI models: An Introduction Guide for QA

How to Test AI models: An Introduction Guide for QA

Get some simple answers to frequently asked questions regarding quality assurance within machine learning.

By 
Anastasiya Kazeeva user avatar
Anastasiya Kazeeva
·
Mar. 16, 18 · Opinion
Likes (4)
Comment
Save
Tweet
Share
18.7K Views

Join the DZone community and get the full member experience.

Join For Free

The latest trends show that machine learning is one of the most rapidly developing fields in computer science. Unfortunately, it’s still unclear to some customers who are neither data scientists nor ML developers how to handle it, but they do know that they need to incorporate AI into their products.

Here are the most frequently asked questions we get from the customers regarding quality assurance within ML:

  1. I want to run UAT; could you please provide your full regression test cases for AI?
  2. OK, I have the model running in production; how can we assure it doesn’t break when we update it?
  3. How can I make sure it produces the right values I need?

What are some simple answers here?

A Brief Introduction to Machine Learning

In order to get how ML works, let’s take a closer look at ML model essence.

What is the difference between classical algorithms/hardcoded functions and ML-based models?

From the black-box perspective, it’s the same box with input and output. Fill the inputs in, get the outputs — what a beautiful thing!

From the white-box perspective, and specifically from how the system is built, it’s a bit different. The core difference here is:

  1. You write the function, or
  2. The function is fitted by a specific algorithm based on your data.

You can verify the ETL part of model coefficients, but you can't verify model quality just as easily as other parameters.

So, What About QA?

The model review procedure is similar to code review but tailored for data science teams. I haven’t seen a lot of QA engineers participating in this particular procedure, but then comes model quality assessment, improvement, etc. The assessment itself is usually happening inside the data science team.

Traditional QA happens for integration cases. Here are five points indicating that you have reasonable quality assurance when dealing with machine learning models in production:

  1. You have a service based on ML functions that is deployed in production. It’s up and running and you want to control that it’s not broken by an automatically deployed new version of the model. In this case, there's a pure black-box scenario: load the test dataset and verify that it has an acceptable output (for example, compare it to the predeployment stage results). Keep in mind: it’s not about exact matching; it’s about the best-suggested value. So, you need to be aware of acceptable dispersion rate.
  2. You want to verify that deployed ML functions process the data correctly (i.e. +/- inversion). That’s where the white-box approach works best: use unit and integration tests for correct input data loading in the model, check for the right (+\-), and check feature output. Wherever you use ETL, it’s good to have white-box checks.
  3. Production data can mutate; the same input produces new expected output with time. For example, something changes user behavior and the quality of model falls. The other case is dynamically changing data. If that risk is high, here are two approaches:
    1. Simple, but expensive approach: Retrain daily on the new dataset. In this case, you need to find the right balance for your service since retraining is highly related to your infrastructure cost.
    2. Complex approach: Depends on how you collect the feedback. For binary classification, for example, you can calculate metrics: precision, recall, and f1 score. Write a service with dynamic model scoring based on these parameters. If it falls below 0.6, it’s an alert; if it falls below 0.5, it’s a critical incident.
  4. Public beta tests work very well for certain cases. You assess your model quality on data that wasn’t used previously. For instance, add 300 more users to generate data and process it. Ideally, the more new data you test on, the better. The original dataset is good, but a larger amount of high-quality data is always better. Note: Test data extrapolation is not a good case here; your model should work well with real users, not on the predicted or generated data.
  5. Automatically ping the service to make sure it’s alive (not specifically ML testing, but shouldn’t be forgotten). Use Pingdom. Yeah, this simple thing saved face a lot of times. There are a lot of more advanced DevOps solutions out there; however, for us, everything started with this solution — and we benefited a lot from it.

Answers

These answers cover pretty much everything concerning QA participation. Now, let’s answer the customers’ questions we set at the beginning of this article.

  1. I want to run UAT; could you please provide your full regression test cases for AI?
    1. Describe the black box to the customer, and provide them with test data and a service that can process and visualize the output.
    2. Describe all the testing layers, whether you verify data and model features on ETL layers, and how you do it.
    3. Produce a model quality report. Provide the customer with model quality metrics vs. standard values. Get these from your data scientist.
  2. OK, I have the model running in production; how can we assure it doesn’t break when we update it?
    1. You need to have a QA review of any production push as well as for any other software.
      1. Perform black-box smoke test. Try various types of inputs based on the function.
      2. Verify model metrics on the production server with a sample of test data. If needed, isolate the part of prod server so the users aren’t affected by the test.
      3. Of course, make sure your white box tests are passing.
  3. How can I make sure it produces the right values I need?
    1. You should always be aware of the acceptable standard deviation for your model and data. Spend some time with your data scientist and dig deeper into model type and technical aspects of the algorithms.

Any other questions you have in mind? Let’s try to figure them out and get the answers!

Question answering AI Testing Machine learning Data science

Opinions expressed by DZone contributors are their own.

Related

  • How Can Artificial Intelligence Transform Software Testing?
  • AI, ML, and Data Science: Shaping the Future of Automation
  • The Battle of Data: Statistics vs Machine Learning
  • Optimizing Data Management for AI Success: Industry Insights and Best Practices

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: