Installation

llmeval is a tool which help establish best practices for organizing prompt templates and running LLM application-level evaluation. Used together with log10, and your team can make safe changes, evolve and increase the reliability and quality of your LLM application using the Log10 GitHub application.

Running `llmeval` locally

To install llmeval and get your repository set up for evaluation, you can execute the following:

$ cd yourrepo
$ pip install llmeval
$ # Initialize the repository with the expected prompt configuration structure
$ llmeval init
...
$ # Run evaluation
$ llmeval run
...
$ # Check results in markdown reports
$ llmeval report

This will initialize a prompts directory with a few examples to get started:

llmeval.yaml
- capitals.yaml
- code.yaml
- grading.yaml
- math.yaml
- summary.yaml
- - capitals.yaml
  - code.yaml
  - grading.yaml
  - math.yaml
  - summary.yaml

To demonstrate the prompt and test files, let us look at the math example in yourrepo/prompts/math.yaml:

defaults:
  - tests/math
name: math
model: gpt-4-16k
temperature: 0.0
variables:
  - name: a
  - name: b
messages:
  - role: "human"
    content: "What is {a} + {b}? Only return the answer without any explanation"

ℹ️

That the message template can be any number of messages from different roles (human, system or ai).

With a test case looking like this (from yourrepo/prompts/tests/math.yaml):

metrics:
  - name: exact_match
    code: |
      metric = actual == expected
      result = metric
references:
  - input:
      a: 4
      b: 4
    expected: "8"

The metric code is evaluated by a python interpreter, so can be anything from simple comparison (strict, fuzzy, includes etc) or model based evaluation like correctness scores, or custom quality grading (from yourrepo/prompts/capitals.yaml):

metrics:
  - name: correctness
    code: |
      from langchain.evaluation import load_evaluator
      evaluator = load_evaluator("labeled_criteria", criteria="correctness")
      eval_result = evaluator.evaluate_strings(
        input=prompt,
        prediction=actual,
        reference=expected,
      )
      metric = eval_result["score"]
      result = metric > 0.5
references:
  - input:
      country: Denmark
    expected: Copenhagen
  - input:
      country: Germany
    expected: Berlin

In this case, langchain’s model based evaluator is used to determine the degree of correctness based on the reference expected value. But as you can probably tell, this could be customized quite a bit to fit your criteria for quality, or calling into tools or libraries that you may have.

Using `llmeval` with GitHub

🚀

Alpha release evaluation support is still in alpha. To get setup with the log10 github evaluation flow, please get in touch with us at [email protected]

Install application

Go to the log10 github application (opens in a new tab) and install it on the repository you are working on.

Select repo owner

Select repo

Submit your first PR

Use git to create a PR (which has been initialized with llmeval init).

After a few minutes, you should get a report comment and status with the test results.

Installation

Running `llmeval` locally

Using `llmeval` with GitHub

Install application

Select repo owner

Select repo

Submit your first PR

Next steps

Coming soon

Installation

Running llmeval locally

Using llmeval with GitHub

Install application

Select repo owner

Select repo

Submit your first PR

Next steps

Coming soon

Running `llmeval` locally

Using `llmeval` with GitHub