llmeval is a tool which help establish best practices for organizing prompt templates and running LLM application-level evaluation. Used together with log10, and your team can make safe changes, evolve and increase the reliability and quality of your LLM application using the Log10 GitHub application.


Running llmeval locally

To install llmeval and get your repository set up for evaluation, you can execute the following:

$ cd yourrepo
$ pip install llmeval
$ # Initialize the repository with the expected prompt configuration structure
$ llmeval init
$ # Run evaluation
$ llmeval run
$ # Check results in markdown reports
$ llmeval report

This will initialize a prompts directory with a few examples to get started:

    • llmeval.yaml
      • capitals.yaml
      • code.yaml
      • grading.yaml
      • math.yaml
      • summary.yaml
        • capitals.yaml
        • code.yaml
        • grading.yaml
        • math.yaml
        • summary.yaml
  • To demonstrate the prompt and test files, let us look at the math example in yourrepo/prompts/math.yaml:

      - tests/math
    name: math
    model: gpt-4-16k
    temperature: 0.0
      - name: a
      - name: b
      - role: "human"
        content: "What is {a} + {b}? Only return the answer without any explanation"

    That the message template can be any number of messages from different roles (human, system or ai).

    With a test case looking like this (from yourrepo/prompts/tests/math.yaml):

      - name: exact_match
        code: |
          metric = actual == expected
          result = metric
      - input:
          a: 4
          b: 4
        expected: "8"

    The metric code is evaluated by a python interpreter, so can be anything from simple comparison (strict, fuzzy, includes etc) or model based evaluation like correctness scores, or custom quality grading (from yourrepo/prompts/capitals.yaml):

      - name: correctness
        code: |
          from langchain.evaluation import load_evaluator
          evaluator = load_evaluator("labeled_criteria", criteria="correctness")
          eval_result = evaluator.evaluate_strings(
          metric = eval_result["score"]
          result = metric > 0.5
      - input:
          country: Denmark
        expected: Copenhagen
      - input:
          country: Germany
        expected: Berlin

    In this case, langchain’s model based evaluator is used to determine the degree of correctness based on the reference expected value. But as you can probably tell, this could be customized quite a bit to fit your criteria for quality, or calling into tools or libraries that you may have.

    Using llmeval with GitHub


    Alpha release evaluation support is still in alpha. To get setup with the log10 github evaluation flow, please get in touch with us at [email protected]

    Install application

    Go to the log10 github application (opens in a new tab) and install it on the repository you are working on.

    Install Github Application

    Select repo owner

    Select owner

    Select repo

    Select repo

    Submit your first PR

    Use git to create a PR (which has been initialized with llmeval init).


    After a few minutes, you should get a report comment and status with the test results.


    Next steps

    To learn more about the configuration and reports, read more here.

    Coming soon

    • Comparison to baselines values
    • Report history and overview