Evaluation
Metrics

Metrics

Creating an evaluation harness largely consist of collecting a ground truth dataset and defining a set of metrics to evaluate the performance. Picking the right metrics can be tricky for generative applications, as the output is often subjective and can vary greatly.

Comparisons

Strict comparison

The simplest way to compare two strings is to check if they are equal. This is often referred to as a strict comparison and is useful for tasks where the output is expected to be identical to the ground truth. However, this approach can be too strict for generative tasks, as the model may produce slightly different but still correct outputs like '2 + 2 = 4' and '2 + 2 = four'.

import pytest
from log10.load import OpenAI
 
 
def test_addition():
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini", messages=[{"role": "system", "content": "2 + 2"}]
    )
 
    assert "4" == response.choices[0].message.content

Now, this will often fail, as models tend to explain the answer in different ways.

Fuzzy comparisons

To address the issue of strict comparisons, we can use fuzzy matching algorithms that allow for some flexibility in the comparison. includes is a simple fuzzy matching algorithm that checks if a substring is present in the output.

import pytest
from log10.load import OpenAI
 
 
def test_addition():
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini", messages=[{"role": "system", "content": "2 + 2"}]
    )
 
    assert "4" in response.choices[0].message.content

Which should pass, as long as the output contains the number 4. Other fuzzy matching algorithms can go from lower casing, removing punctuation, to more advanced algorithms like leichtenstein distance, jaccard similarity, and more.

BLEU

BLEU is a popular metric for evaluating machine translation. It compares the candidate sentence to one or more reference sentences and returns a score between 0 and 1. The higher the BLEU score, the better the translation and the closer the candidate sentence is to the reference sentence.

from log10.load import OpenAI
 
import nltk
 
 
def test_bleu():
    reference = "math is fun".split()
 
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini", messages=[{"role": "system", "content": "Tell me that math is fun without any explanation"}]
    )
 
    candidate = response.choices[0].message.content.split()
 
    bleu = nltk.translate.bleu_score.sentence_bleu([reference], candidate)
 
    print(bleu)
    print(candidate)
    print(reference)
    assert bleu > 0.1 # This often fails, as the model tends to explain the answer in different ways.

ROUGE

Alternatively, ROUGE is a set of metrics for evaluating automatic summarization and machine translation. Unlike BLEU, ROUGE is more complex to implement and requires additional preprocessing steps such as tokenization and stemming.

rouge = nltk.translate.bleu_score.ROUGE([reference], candidate)

Read more at nltk.translate.bleu_score (opens in a new tab) and nltk.translate.bleu_score.ROUGE (opens in a new tab).

Tools

Not to be confused with tool calling, external tools can be used to evaluate the output of an LLM. In use cases where LLMs are generating code that can be validated with tools, like compilers and linters, you can use these tools to evaluate the output. Within python, you can use the os.system function to run shell commands and check the return code, or rely on libraries for more advanced validation.

import pytest
from log10.load import OpenAI
 
def test_sql():
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini", messages=[{"role": "system", "content": "Generate SQL that creates a table with a name and an age column, and inserts a row with the name 'Alice' and age 30"}]
    )
 
    with open("output.sql", "w") as f:
        f.write(response.choices[0].message.content)
 
    assert os.system("sqlite3 -init output.sql") == 0

LLM as a judge

In some cases, the LLM can be used as a judge to evaluate the output of another LLM. This can be useful when the ground truth is not available or when the output is subjective.

Here is an example of using an LLM to evaluate a story generated by another LLM. This would otherwise be hard to evaluate with traditional metrics.

import pytest
from log10.load import OpenAI
 
 
def test_summary():
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "generate a story about a cat, but don't include the word cat",
            }
        ],
    )
 
    assert " cat " not in response.choices[0].message.content.lower()
 
    # Call the LLM and ask what the story is about
 
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": response.choices[0].message.content},
            {"role": "system", "content": "Is this story about a cat? Y/N?"},
        ],
    )
 
    assert response.choices[0].message.content.lower().startswith("y")
 

Best practices for LLM as a judge

It can be deceivingly easy to use an LLM as a judge, but there are some best practices to keep in mind:

  • LLMs have biases that can impact the evaluation; positional biases (e.g., the first criteria described in a rubric is more likely to be selected), self preference (e.g., the LLM prefers its own output), and more.
  • LLMs are not good at scoring in small or large ranges, as they tend to be more confident in the middle of the range from 1 to 5. But in practice, YES / NO questions are often more precise.
  • LLMs as judge with few shot samples can get expensive, as the cost of the LLM is proportional to the number of tokens generated. The output to evaluate can be a fraction of the tokens for the rubric and samples.

To mitigate these issues, take a look at AutoFeedback for more information about how to use human feedback to bootstrap your evaluation process.