DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • AI, ML, and Data Science: Shaping the Future of Automation
  • The Battle of Data: Statistics vs Machine Learning
  • Optimizing Data Management for AI Success: Industry Insights and Best Practices
  • MLOps: How to Build a Toolkit to Boost AI Project Performance

Trending

  • How To Introduce a New API Quickly Using Quarkus and ChatGPT
  • Code Reviews: Building an AI-Powered GitHub Integration
  • Apple and Anthropic Partner on AI-Powered Vibe-Coding Tool – Public Release TBD
  • The End of “Good Enough Agile”
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Best Practices for Setting up Monitoring Operations for Your AI Team

Best Practices for Setting up Monitoring Operations for Your AI Team

In this post, we'll explore key tips to help you set up a robust monitoring operation that proactively addresses issues before they negatively impact your business KPIs.

By 
Itai Bar-Sinai user avatar
Itai Bar-Sinai
·
Mar. 24, 23 · Opinion
Likes (2)
Comment
Save
Tweet
Share
5.1K Views

Join the DZone community and get the full member experience.

Join For Free

In recent years, the term MLOps has become a buzzword in the world of AI, often discussed in the context of tools and technology. However, while much attention is given to the technical aspects of MLOps, what's often overlooked is the importance of the operations. There is often a lack of discussion around the operations needed for machine learning (ML) in production and monitoring specifically. Things like accountability for AI performance, timely alerts for relevant stakeholders, and the establishment of necessary processes to resolve issues are often disregarded for discussions about specific tools and tech stacks. 

ML teams have traditionally been research-oriented, focusing heavily on training models to achieve high testing scores. However, once the model is ready to be deployed in real business processes and applications, the culture around establishing production-oriented operations is lacking. As a consequence, there is a lack of clarity regarding who is responsible for the models' outcomes and performance. Without the right operations in place, even the most advanced tools and technology won't be enough to ensure healthy governance for your AI-driven processes. 

1. Cultivate a Culture of Accountability

As previously stated, data science and ML teams have traditionally been research-oriented and were measured on model evaluation scores and not on real-world, business-related outcomes. In such an environment, there is no way monitoring will be done correctly because frankly - no one cares sufficiently. To fix this situation, the team responsible for building AI models must take ownership and feel accountable for the models' success or failure in serving the business function it was designed for. 

The best way to achieve this is by measuring the individual's and the team's performance based on production-oriented KPIs and creating an environment that fosters a sense of ownership over the model's overall performance rather than just in controlled testing environments.

While some team members may remain focused on research, it's important to recognize that achieving good test scores in experiments is not sufficient to ensure the model's success in production. The ultimate success of the model lies in its effectiveness in real-world business processes and applications.

2. Make a "Monitoring Plan" Part of Your Release Checklist 

To ensure the ongoing success of an AI-driven application, planning how it is going to be monitored is a critical factor that should not be overlooked. 

In healthy engineering organizations, there is always a release checklist that entails setting up a monitoring plan whenever a new component is released. AI teams should follow that pattern. The person or team responsible for building a model must have a clear understanding of how it fits into the overall system and should be able to predict potential issues that could arise, as well as identify who needs to be alerted and what actions should be taken in the event of an issue.

While some potential issues may be more research-oriented, such as data or concept drift, there are many other factors to consider, such as a broken feature pipeline or a third-party data provider changing input formats. Therefore, it is important to anticipate as many of these issues as possible and set up a plan to effectively deal with them should they arise.

Although it's very likely that there are potential issues that will remain unforeseen, it's still better to do something rather than nothing, and typically, the first 80% of issues can be anticipated with 20% of the work. 

3. Establish an On-Call Rotation 

Sharing the responsibility among team members may be necessary or helpful, depending on the size of your team and the number of models or systems under your control. By setting up an "on-call" rotation, everyone can have peace of mind knowing that there is at least one knowledgeable person available to handle any issues the moment they arise.

It's important to note that taking care of an issue doesn't necessarily mean solving the problem immediately. Sometimes, it might mean triaging and deferring it to a later time or waking up the person who is best equipped to solve the problem. Sharing an on-call rotation with pre-existing engineering teams can also be an option in some instances. However, this is use-case dependent and may not be possible for every team.

Regardless of the approach, it is imperative to establish a shared knowledge base that the person on-call can utilize so that your team can be well-prepared to take care of emerging issues.

4. Set up a Shared Knowledge Base

To maintain healthy monitoring operations, it is essential to have accessible resources that detail how your system works and its main components. This is where wikis and playbooks come in. Wikis can provide a central location for documentation on your system, including its architecture, data sources, and model dependencies. Playbooks can be used to document specific procedures for handling common issues or incidents that may arise.

Having these resources in place can help facilitate knowledge sharing and ensure that everyone on the team is equipped to troubleshoot and resolve issues quickly. It also allows for smoother onboarding of new team members who can quickly get up to speed on the system. In addition, having well-documented procedures and protocols can help reduce downtime and improve response times when issues transpire. 

5. Implement Post Mortems

Monitoring is an iterative process. It is impossible to predict everything that might go wrong in advance. But when an issue does occur and goes undetected or unresolved for too long, it is important to conduct a thorough analysis of the issue and identify the root cause. Once a root cause is understood, the built monitoring plan can be amended and improved accordingly.

Post mortems also help in building a culture of accountability, which, as discussed earlier, is the key factor in having successful monitoring operations.

6. Get the Right Tools for Effective Monitoring

Once you have established the need of maintaining healthy monitoring operations and addressed any cultural considerations, the next critical step is to equip your team members with the appropriate tools to empower them to be accountable for the model's performance in the business function it serves. 

This means implementing tools that enable timely alerts for issues (which is difficult due to issues typically starting small and hidden), along with capabilities for root cause analysis and troubleshooting. Integrations with your existing tools, such as ticketing systems, as well as issue tracking and management capabilities, are also essential for seamless coordination and collaboration among team members. Investing in the right tools will empower your team to take full ownership and accountability, ultimately leading to better outcomes for the business.  

Conclusion 

By following these guidelines, you can be sure that your AI team will be set up for successful production-oriented operations. Monitoring is a crucial aspect of MLOps, involving accountability, timely alerts, troubleshooting, and much more. Taking the time to set up healthy monitoring practices leads to continuous improvements. 

AI Data science Knowledge base Machine learning teams Monitor (synchronization)

Published at DZone with permission of Itai Bar-Sinai. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • AI, ML, and Data Science: Shaping the Future of Automation
  • The Battle of Data: Statistics vs Machine Learning
  • Optimizing Data Management for AI Success: Industry Insights and Best Practices
  • MLOps: How to Build a Toolkit to Boost AI Project Performance

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!