DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • ITBench, Part 1: Next-Gen Benchmarking for IT Automation Evaluation
  • Modern Test Automation With AI (LLM) and Playwright MCP
  • AI Speaks for the World... But Whose Humanity Does It Learn From?
  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application

Trending

  • Traditional Testing and RAGAS: A Hybrid Strategy for Evaluating AI Chatbots
  • Manual Sharding in PostgreSQL: A Step-by-Step Implementation Guide
  • The Smart Way to Talk to Your Database: Why Hybrid API + NL2SQL Wins
  • Can You Run a MariaDB Cluster on a $150 Kubernetes Lab? I Gave It a Shot
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. OpenAI Evals Using Phoenix

OpenAI Evals Using Phoenix

OpenAI Evals are used for evaluating LLM models and finding accuracy, which helps to compare the custom models and figure out how best your custom model performs.

By 
Somanath Balakrishnan user avatar
Somanath Balakrishnan
DZone Core CORE ·
Apr. 02, 24 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
1.3K Views

Join the DZone community and get the full member experience.

Join For Free

OpenAI Evals are used for evaluating LLM models and finding accuracy, which helps to compare the custom models with some existing models and figure out how best your custom model performs, so you can make necessary modifications/refinements.

If you are new to OpenAI Evals, then I recommend you to go through the OpenAI Eval repo to get a taste of what Evals are actually like. It's like what Greg said here: 

Greg's tweet

Role of Evals

From OpenAI Eval repo: Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. 

Can you imagine writing an evaluation program for a complex model by yourself? You may spend hours creating an LLM model and you may not have room to work on an evaluation program since that effort may take more effort than creating a model. That is where the Eval framework helps you to test the LLM models to ensure their accuracy. You can use GPT3.x or GPT4 based on the need and where your LLM Model's target is.

Building Evals

OpenAI Eval repo has a good intro and detailed steps to create a custom eval for an arithmetic model:

  • Intro
  • Building an eval (using available options)
  • Building a custom eval

The links above pretty much cover what you require for running an eval or custom evals, you are covered there. My take here is how I can enable/help you to use the Phoenix framework seems a bit easier than the OpenAI evals. Phoenix actually created on top of the OpenAI Eval framework.

  • Phoenix Home
  • Repo
  • LLM evals documentation
  • How to
  • This LLM explanation will help you better understand.

Building Custom Evals

Building your own evals is one of your go-to compare your custom model with GPT 3.5 or 4, so here are the steps for that.

Below are the steps that I followed and tested to evaluate my models:

Install Phoenix and related modules: 

Python
 
!pip install -qq "arize-phoenix-evals" "openai>=1" ipython matplotlib pycm scikit-learn tiktoken


Make sure you have all imports covered.

Python
 
import os

from getpass import getpass

import matplotlib.pyplot as plt

import openai

import pandas as pd

# import phoenix libs

import phoenix.evals.templates.default_templates as templates

from phoenix.evals import (

    OpenAIModel,

    download_benchmark_dataset,

    llm_classify,

)

from pycm import ConfusionMatrix

from sklearn.metrics import classification_report


Prepare the data (or download the data sets). Eg:

Python
 
df = download_benchmark_dataset(task="qa-classification", dataset_name="qa_generated_dataset")


Set your OpenAI key.

Python
 
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):

    openai_api_key = getpass(" Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

5. prepare the data sets in correct format as per the evaluation prompt

df_sample = (

    df.sample(n=N_EVAL_SAMPLE_SIZE)

    .reset_index(drop=True)

    .rename(

        columns={

            "question": "input",

            "context": "reference",

            "sampled_answer": "output",

        }

    )

)


Set and load the model for running evals.

Python
 
model = OpenAIModel(

    model_name="gpt-4",

    temperature=0.0,

)


Run your custom evals.

Python
 
rails = list(templates.QA_PROMPT_RAILS_MAP.values())

Q_and_A_classifications = llm_classify(

    dataframe=df_sample,

    template=templates.QA_PROMPT_TEMPLATE,

    model=model,

    rails=rails,

    concurrency=20,

)["label"].tolist()


Evaluate the above predictions with pre-defined labels.

Python
 
true_labels = df_sample["answer_true"].map(templates.QA_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, Q_and_A_classifications, labels=rails))


Create a confusion matrix and plot to get a better picture.

Python
 
confusion_matrix = ConfusionMatrix(

    actual_vector=true_labels, predict_vector=Q_and_A_classifications, classes=rails

)

confusion_matrix.plot(

    cmap=plt.colormaps["Blues"],

    number_label=True,

    normalized=True,

)


Note: You can set model "gpt-3.5-turbo" in step 6 and run evals against GPT 3.5 or any other models you want to evaluate your custom model with.

Here is one helpful link to Google Colab which I followed where I can find good step-by-step instructions:

PS: The code and steps I have mentioned in this article are based on this collab book.

Here is a good article by Aparna Dhinakaran (co-founder and CPO of Arize AI and a Phoenix contributor) about Evals and Phoenix.

Conclusion

I hope this article helped you to understand how evals can be implemented for custom models. I would be happy if you got at least some insights about evals and some motivation to create your own! All the best for your trials and tries.

AI Evaluation Apache Phoenix large language model

Opinions expressed by DZone contributors are their own.

Related

  • ITBench, Part 1: Next-Gen Benchmarking for IT Automation Evaluation
  • Modern Test Automation With AI (LLM) and Playwright MCP
  • AI Speaks for the World... But Whose Humanity Does It Learn From?
  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!