Application Task Driven: LLM Evaluation Metrics in Detail

List of common LLM systems evaluation metrics that are based on specific applications and a supporting open-source framework.

Sapan Patel

May. 24, 24 · Tutorial

Likes (1)

Comment

Save

1.2K Views

In the dynamic landscape of Natural Language Processing (NLP), the evaluation of Language Model (LM) performance stands as a pivotal aspect in gauging their efficacy across various downstream applications. Different applications demand distinct performance indicators aligned with their goals. In this article, we'll take a detailed look at various LLM evaluation metrics, exploring how they apply to real-world scenarios. From traditional summarization tasks to more nuanced contextual evaluations, we navigate through the evolving methodologies employed to assess the proficiency of Language Models, shedding light on their strengths, limitations, and practical implications in driving advancements in NLP research and applications. Below are some common Text Application tasks and corresponding evaluation metrics/frameworks.

1. Text Summarization

Text summarization is a natural language processing (NLP) task aimed at reducing/distilling the content of a given text document into a shorter version while retaining the most important information and the overall meaning of the original text. Text summarization can be performed using extractive or abstractive techniques. Some of the metrics/frameworks for evaluating such a system can be:

SUPERT: Unsupervised Multi-Document Summarization Evaluation and Generation. It rates the quality of a summary by measuring its semantic similarity with a pseudo reference summary (selected salient sentences from the source documents, using contextualized embeddings and soft token alignment techniques).
BLANC: It measures the functional performance of a summary with an objective, reproducible, and fully automated method. It achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document's text.
FactCC: It uses a weakly-supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and a generated summary.

    Python
   
 

   >>> from blanc import BlancHelp, BlancTune
>>> document = "Jack drove his minivan to the bazaar to purchase milk and honey for his large family."
>>> summary = "Jack bought milk and honey."
>>> blanc_help = BlancHelp()
>>> blanc_tune = BlancTune(finetune_mask_evenly=False, show_progress_bar=False)
>>> blanc_help.eval_once(document, summary)
0.2222222222222222
>>> blanc_tune.eval_once(document, summary)
0.3333333333333333
  

Sample code for basic usage of BLANC metric.

    Python
   
 

   from ref_free_metrics.supert import Supert
from utils.data_reader import CorpusReader

# read docs and summaries
reader = CorpusReader('data/topic_1')
source_docs = reader()
summaries = reader.readSummaries() 

# compute the Supert scores
supert = Supert(source_docs) 
scores = supert(summaries)
  

Sample code for basic usage of SUPERT metric.

2. Overlap Text Similarity

Overlap-based text similarity measures quantify the similarity between two pieces of text by assessing the presence and frequency of shared words, phrases, or n-grams. These are straightforward and computationally efficient but may not capture semantic similarity accurately, especially when dealing with text containing synonyms, paraphrases, or different word forms. Some of the metrics/frameworks for evaluating such a system can be:

BLUE (Bilingual Evaluation Understudy): It is a widely used recision-based measure for evaluating the quality of machine-translated text by comparing it to human translations. BLEU scores individual translated segments against reference translations and averages them to estimate the overall quality, focusing on correspondence rather than intelligibility or grammatical correctness.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): It focuses on evaluating the quality of summaries or generated text by comparing them to one or more reference texts. It measures the overlap of n-grams (contiguous sequences of n items, usually words) between the generated text and the reference texts. ROUGE includes multiple variants such as ROUGE-N (which considers n-gram overlap), ROUGE-L (which measures the longest common subsequence between the generated and reference texts), and ROUGE-W (which considers weighted overlaps).
METEOR (Metric for Evaluation of Translation with Explicit Ordering): It is another widely used evaluation metric in the field of machine translation. Unlike ROUGE and BLEU, which primarily focus on n-gram overlap, METEOR incorporates additional linguistic features such as stemming, synonymy, and word order to assess the quality of translated text. It computes a harmonic mean of precision and recall, giving equal weight to both. It also includes penalties for word order differences and unaligned words to encourage translations that preserve the order and content of the reference translation.

    Python
   
 

   
>>> predictions = ["hello there general kenobi", "foo bar foobar"]
>>> references = [
...     ["hello there general kenobi", "hello there !"],
...     ["foo bar foobar"]
... ]
>>> bleu = evaluate.load("bleu")
>>> results = bleu.compute(predictions=predictions, references=references)
>>> print(results)
{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 7, 'reference_length': 6}

  

Sample code for basic usage of BLEU metric from Huggingface.

    Python
   
 

   >>> rouge = evaluate.load('rouge')
>>> predictions = ["hello goodbye", "ankh morpork"]
>>> references = ["goodbye", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
...                         references=references,
...                         use_aggregator=False)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
[0.5, 0.0]
  

Sample code for basic usage of ROUGE metric from Huggingface.

3. Semantic Text Similarity

Semantic text similarity captures the underlying semantics or the meaning of two pieces of text rather than just their structural overlap. Using natural language processing (NLP) and machine learning techniques, semantic text similarity methods represent words, phrases, or entire text passages as dense, continuous vectors in a high-dimensional semantic space. Some of the metrics/frameworks for evaluating such a system can be:

BERTScore: It leverages pre-trained BERT (Bidirectional Encoder Representations from Transformers) models to compute similarity scores between sentences or text passages. It calculates the similarity based on contextual embeddings obtained from BERT, which captures semantic information by considering the surrounding context of each word, providing a more nuanced evaluation of the language generation task. It has been shown to correlate well with human judgments of text quality. Using the appropriate BERT model becomes crucial, as it impacts storage space and the accuracy of the score.
MoverScore: It measures the semantic similarity between two text passages by computing the minimal cost of transforming one passage into another using an optimal transport algorithm. It is based on distributional semantics and focuses on aligning the distribution of words between passages. By considering both the content and the structure of the text, MoverScore provides a robust measure of semantic similarity that is less sensitive to surface-level differences such as word order or vocabulary choice.

    Python
   
 

   from evaluate import load
bertscore = load("bertscore")
predictions = ["hello world", "general kenobi"]
references = ["hello world", "general kenobi"]
results = bertscore.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased")
print(results)
{'precision': [1.0, 1.0], 'recall': [1.0, 1.0], 'f1': [1.0, 1.0], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.10(hug_trans=4.10.3)'}

  

Sample code for basic usage of BERTScore metric from Huggingface.

4. RAG (Retrieval-Augmented Generation)

RAG is an innovative approach to natural language processing that combines the strengths of both retrieval-based and generation-based models. In RAG, a large-scale pre-trained retriever model is employed to retrieve relevant context or passages from a knowledge source, such as a large text corpus or a knowledge graph. These retrieved passages are then used as input or guidance for a generation model, such as a language model or a transformer, to produce coherent and contextually relevant text outputs. Some of the metrics/frameworks for evaluating such a system can be:

RAGAs: Ragas aims to create an open standard, providing developers with the tools and techniques to leverage continual learning in their RAG applications. RAGs allow you to synthetically generate a diverse test dataset to evaluate your application. It also allows LLM-assisted evaluation metrics to objectively measure the performance of the application. Essentially Ragas offers metrics tailored for evaluating each component of your RAG pipeline in isolation (e.g. Generation — Faithfulness and Answer Relevancy, Retrieval — Context Precision and recall).
1. a. Faithfulness: This measures the factual consistency of the generated answer against the given context. It is calculated from the answer and retrieved context. The answer is scaled to the (0,1) range. Higher the better.
2. Answer relevancy: This measure focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy.
3. Context recall: Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth.
4. Context precision: Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally, all the relevant chunks must appear at the top ranks.
5. Context relevancy: This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy.
6. Context entity recall: It is a measure of what fraction of entities are recalled from ground_truths. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc.
7. Answer semantic similarity: The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth.
8. Answer correctness: The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth.
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
The automated process combines synthetic data generation with fine-tuned classifiers to efficiently assess context relevance, answer faithfulness, and answer relevance, minimizing the need for extensive human annotations. ARES employs synthetic query generation and Precision-Performance Iteration (PPI), providing accurate evaluations with statistical confidence.

    Python
   
 

   from datasets import Dataset 
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness

os.environ["OPENAI_API_KEY"] = "your-openai-key"

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}

dataset = Dataset.from_dict(data_samples)

score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()
  

Example of faithfulness from RAGs.

    Python
   
 

   from ares import ARES

ues_idp_config = {
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv", 
    "model_choice" : "gpt-3.5-turbo-0125"
} 

ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}
  

Example of retrieving the UES/IDP scores with GPT3.5 using ARES.

5. QA (Question-Answering)

A task that involves designing algorithms and models to automatically generate answers to questions posed in natural language. The task typically involves processing a question, understanding its semantics, and then searching through a given context or knowledge base to find relevant information that can directly answer the question. Its complexity may range from simple fact-based questions to more complex scenarios requiring reasoning and inference. Some of the metrics/frameworks for evaluating such a system can be:

QAEval: QAEval is a question-answering-based metric for estimating the content quality of a summary. It generates QA pairs from reference summaries and then uses a QA model to answer the questions against a candidate summary. The final score is the portion of questions that were answered correctly.
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization. It builds on QAEval with question consistency filtering and an improved answer overlap metric, leading to a 14% average improvement over previous QA-based metrics on the SummaC factual consistency benchmark.
QuestEval: It is an NLG metric to assess if two different inputs contain the same information. The metric, based on Question Generation and Answering can deal with multimodal and multilingual inputs. In contrast to established metrics such as ROUGE or BERTScore, QuestEval does not require any ground-truth reference.

    Python
   
 

   from qafacteval import QAFactEval
kwargs = {"cuda_device": 0, "use_lerc_quip": True, \
        "verbose": True, "generation_batch_size": 32, \
        "answering_batch_size": 32, "lerc_batch_size": 8}

model_folder = "" # path to models downloaded with download_models.sh
metric = QAFactEval(
    lerc_quip_path=f"{model_folder}/quip-512-mocha",
    generation_model_path=f"{model_folder}/generation/model.tar.gz",
    answering_model_dir=f"{model_folder}/answering",
    lerc_model_path=f"{model_folder}/lerc/model.tar.gz",
    lerc_pretrained_model_path=f"{model_folder}/lerc/pretraining.tar.gz",
    **kwargs
)

results = metric.score_batch_qafacteval(["This is a source document"], [["This is a summary."]], return_qa_pairs=True)
score = results[0][0]['qa-eval']['lerc_quip']

  

Example of QAFactEval from QAFactEval.

6. NER (Named Entity Recognition)

NER is a natural language processing (NLP) task that involves identifying and classifying named entities within a body of text. Named entities refer to specific entities that are mentioned by name, such as persons, organizations, locations, dates, numerical expressions, and more.

InterpretEval: Taking NER and CWS tasks for example, we have defined 8 attributes for the NER task and 7 attributes for the CWS task. By bucketing them, i.e. breaking their holistic performance into different categories. This can be achieved by dividing the set of test entities into different subsets of test entities (regarding span and sentence-level attributes) or test tokens (regarding token-level attributes). Finally measuring performance of each bucket by statistical measures.

DeepEval: Open Source LLM Evaluation Framework

DeepEval is one of the best, simple-to-use, open-source LLM evaluation frameworks. It incorporates the latest research to evaluate LLM outputs based on various metrics discussed above and more, which uses LLMs and various other NLP models that run locally on your machine for evaluation.

    Python
   
 

   from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(input=input, actual_output=actual_output)
metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    assessment_questions=[
        "Is the coverage score based on a percentage of 'yes' answers?",
        "Does the score ensure the summary's accuracy with the source?",
        "Does a higher score mean a more comprehensive summary?"
    ]
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])
  

Sample SummarizationMetric from DeepEval.

Conclusion

In this article, we've explored various evaluation metrics and supporting frameworks within NLP, examining their practical relevance and implications per text application task. Recognizing the critical role of evaluation in shaping language model development, it is essential to continuously refine methodologies and embrace emerging paradigms. Knowing the right evaluation metric to use for the type of application and having knowledge of frameworks that can support them at scale is paramount to the success of developing large-scale NLP systems.

Language model Machine learning NLP

Opinions expressed by DZone contributors are their own.

Related

Trending