Evaluation
Getting started

Evaluation

Evaluation is easily one of the most important practices when building production grade AI systems. Without it, you can't be sure that your system is working as intended as you grow from a prototype to a production system. When new models are available, implementing new features and accomodating new customers, you need to be able to quickly and confidently evaluate the impact of these changes.

At log10, we built one of the first declarative evaluation systems, LLM Eval, which let users define their evaluation in configuration files, run prompts across models and datasets, and generate reports with a single command.

While this approach was powerful, it was also limited in its flexibility when it comes to testing advanced use-cases with multi-step, agentic systems, tool calling, and context retrieval. Further more, our customers requested ways to bootstrap their evaluations with captured logs and feedback from human evaluators. Requiring a more flexible approach to evaluation.

In this new version, we are introducing a more flexible, code-based approach to evaluation which integrates seamlessly with log10's logging capabilities.

Installation

Follow this guide to get set up with log10 in your local environment. After you have verified that log10 can capture logs from your application, you can install the evaluation package.

$ mkdir eval_test && cd eval_test
$ virtualenv venv
$ source venv/bin/activate
$ pip install log10-io

NOTE We recommend using a virtual environment to install the package, as other pytest runs may be captured and sent to log10.

NOTE If you are not seeing runs or logs, it may be because the pytest is the system wide one. You can check which one is being used with which pytest.

Getting started

Start by creating your first test file, test_example.py.

import pytest
 
def test_example():
    assert 0 == 1

Run the test with the following command:

$ pytest test_example.py --log10

You should see something like this, note that log10-io is listed as one of the plugins:

========================================= test session starts ==========================================
platform darwin -- Python 3.12.2, pytest-8.3.3, pluggy-1.5.0
rootdir: test_log10_eval
plugins: metadata-3.1.1, anyio-4.6.0, log10-io-0.14.0
collected 1 item
 
test_example.py F                                                                                [100%]
 
=============================================== FAILURES ===============================================
_____________________________________________ test_example _____________________________________________
 
    def test_example():
>       assert 0 == 1
E       assert 0 == 1
 
test_example.py:4: AssertionError
========================================== Log10 Eval Report ===========================================
Log10 Eval is enabled.
Test run: tests-06aedd36-69af-45bb-918a-b1cab4e79298
Log10 Evaluation URL: log10.io/app/test-org/evaluations?id=06aedd36-69af-45bb-918a-b1cab4e79298
report saved to: test_log10_eval/.pytest_log10_eval_reports/tests-06aedd36-69af-45bb-918a-b1cab4e79298.report.json
Report successfully uploaded to Log10
======================================= short test summary info ========================================
FAILED test_example.py::test_example - assert 0 == 1
========================================== 1 failed in 27.13s ==========================================
 

And if you follow the link above, you should see the test and the call in the log10 dashboard.

Log10 dashboard

Now, let's make an LLM call and see the test and the call in the log10 dashboard.

import pytest
from log10.load import OpenAI
 
 
def test_example():
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": "What is the capital of France?"}],
    )
 
    # Paris should be in the response, but may have surrounding text, end with a period etc.
    assert "paris" in response.choices[0].message.content.lower()

Run the test again:

$ pytest test_another_example.py --log10
========================================= test session starts ==========================================
platform darwin -- Python 3.12.2, pytest-8.3.3, pluggy-1.5.0
rootdir: test_log10_eval
plugins: metadata-3.1.1, anyio-4.6.0, log10-io-0.14.0
collected 1 item
 
test_another_example.py .                                                                        [100%]
 
========================================== Log10 Eval Report ===========================================
Log10 Eval is enabled.
Test run: tests-c7dab67c-21a9-4aa4-8ae4-f8284c7c4668
Log10 Evaluation URL: log10.io/app/test-org/evaluations?id=c7dab67c-21a9-4aa4-8ae4-f8284c7c4668
report saved to: test_log10_eval/.pytest_log10_eval_reports/tests-c7dab67c-21a9-4aa4-8ae4-f8284c7c4668.report.json
Report successfully uploaded to Log10
===================================== 1 passed in 3.14s =====================================

You should see the test and the call in the log10 dashboard.

Log10 dashboard

Next steps

Now that you have a basic test running, you can start to explore different ways to test your application.