Your First Evaluation

Run your first LLM evaluation in 5 minutes and understand the results.

Step 1: Create Test Cases

Create a file called my-tests.json with some simple test cases:

[
  {
    "eval_id": "math_simple",
    "turns": [
      {
        "role": "user",
        "content": "What is 2+2?"
      },
      {
        "role": "assistant",
        "content": "4",
        "expected": true
      }
    ]
  },
  {
    "eval_id": "capital_france",
    "turns": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      },
      {
        "role": "assistant",
        "content": "Paris",
        "expected": true
      }
    ]
  },
  {
    "eval_id": "greeting",
    "turns": [
      {
        "role": "user",
        "content": "Hello!"
      },
      {
        "role": "assistant",
        "content": "Hello! How can I help you today?",
        "expected": true
      }
    ]
  }
]

Step 2: Create Configuration

Create a file called my-config.yaml:

dataset:
  loader: local_file
  paths:
    - ./my-tests.json

providers:
  - type: gemini
    agent_id: my_first_agent
    model: gemini-2.0-flash-exp
    temperature: 0.0

evaluators:
  - type: response_evaluator
    llm_provider: gemini

reporters:
  - type: console

Step 3: Run Evaluation

judge-llm run --config my-config.yaml

Understanding the Output

You'll see output like this:

Starting evaluation...

Evaluation Progress:
  math_simple: ✓ PASSED (cost: $0.0012, time: 1.2s)
  capital_france: ✓ PASSED (cost: $0.0015, time: 1.5s)
  greeting: ✓ PASSED (cost: $0.0010, time: 1.0s)

Summary:
  Total Tests: 3
  Passed: 3
  Failed: 0
  Success Rate: 100.0%
  Total Cost: $0.0037
  Total Time: 3.7s

What Each Part Means

Evaluation Progress:

math_simple: ✓ PASSED - Test ID and pass/fail status
cost: $0.0012 - API cost for this test
time: 1.2s - Execution time

Summary:

Total Tests: 3 - Number of test cases run
Passed: 3 - Number of tests that passed
Success Rate: 100.0% - Percentage of tests passed
Total Cost: $0.0037 - Combined API cost
Total Time: 3.7s - Total execution time

Step 4: Try Different Outputs

Save to JSON

reporters:
  - type: console
  - type: json
    output_path: ./results.json

Run again:

judge-llm run --config my-config.yaml

Check the JSON output:

cat results.json | python -m json.tool

Generate HTML Report

reporters:
  - type: console
  - type: html
    output_path: ./report.html

Run and open the report:

judge-llm run --config my-config.yaml
open report.html  # On Mac
# Or: xdg-open report.html  # On Linux
# Or: start report.html  # On Windows

Step 5: Add More Evaluators

Let's add cost and latency checks:

dataset:
  loader: local_file
  paths:
    - ./my-tests.json

providers:
  - type: gemini
    agent_id: my_first_agent

evaluators:
  - type: response_evaluator
    llm_provider: gemini
  - type: cost_evaluator
    max_cost: 0.01  # Fail if cost > $0.01
  - type: latency_evaluator
    max_latency: 5.0  # Fail if latency > 5 seconds

reporters:
  - type: console

Run again:

judge-llm run --config my-config.yaml

Now you'll see multiple evaluator results per test:

Evaluation Progress:
  math_simple: ✓ PASSED (cost: $0.0012, time: 1.2s)
    ✓ response_evaluator: Response is correct
    ✓ cost_evaluator: Cost within limit ($0.0012 < $0.01)
    ✓ latency_evaluator: Latency within limit (1.2s < 5.0s)

Common Patterns

Test Multiple Models

Compare different models:

providers:
  - type: gemini
    agent_id: gemini_flash
    model: gemini-2.0-flash-exp
  
  - type: openai
    agent_id: gpt4
    model: gpt-4

evaluators:
  - type: response_evaluator

reporters:
  - type: html
    output_path: ./comparison.html

Multi-Turn Conversations

Test conversations with multiple exchanges:

{
  "eval_id": "conversation_test",
  "turns": [
    {
      "role": "user",
      "content": "My name is Alice."
    },
    {
      "role": "assistant",
      "content": "Nice to meet you, Alice!",
      "expected": true
    },
    {
      "role": "user",
      "content": "What's my name?"
    },
    {
      "role": "assistant",
      "content": "Your name is Alice.",
      "expected": true
    }
  ]
}

System Prompts

Include system instructions:

{
  "eval_id": "with_system_prompt",
  "turns": [
    {
      "role": "system",
      "content": "You are a helpful math tutor. Always show your work."
    },
    {
      "role": "user",
      "content": "What is 15 * 3?"
    },
    {
      "role": "assistant",
      "content": "15 * 3 = 45\nBreaking it down: 15 + 15 + 15 = 45",
      "expected": true
    }
  ]
}

What You've Learned

✅ Created test cases in JSON format
✅ Configured providers and evaluators
✅ Ran your first evaluation
✅ Understood the output
✅ Generated different report formats
✅ Added multiple evaluators

Next Steps

Now you're ready to:

Explore Examples - See more advanced examples
Learn Configuration - Master all config options
Create Custom Evaluators - Build domain-specific evaluators
Use Python API - Run evaluations programmatically

Troubleshooting

Tests Not Passing

If your tests are failing, check:

Is the expected response correct?

{
  "role": "assistant",
  "content": "4",  // Make sure this is what you expect
  "expected": true
}

Are you using the right evaluator?
- Response evaluator checks correctness
- Cost evaluator checks budget
- Latency evaluator checks speed

Check the reason in output:

test_001: ✗ FAILED
  ✗ response_evaluator: Response is incorrect

High Costs

If costs are too high:

Add cost evaluator:

evaluators:
  - type: cost_evaluator
    max_cost: 0.01

Use cheaper models:

providers:
  - type: gemini
    model: gemini-2.0-flash-exp  # Cheaper than gemini-pro

Slow Execution

If evaluations are slow:

Add latency evaluator to find slow tests:

evaluators:
  - type: latency_evaluator
    max_latency: 3.0

Use faster models
Reduce test complexity

Quick Reference

File Structure

project/
├── my-tests.json      # Test cases
├── my-config.yaml     # Configuration
├── .env               # API keys
└── results/           # Output reports

Basic Commands

# Run evaluation
judge-llm run --config my-config.yaml

# Validate config
judge-llm validate --config my-config.yaml

# List available components
judge-llm list providers
judge-llm list evaluators
judge-llm list reporters

Basic Config Template

dataset:
  loader: local_file
  paths: [./tests.json]

providers:
  - type: gemini
    agent_id: my_agent

evaluators:
  - type: response_evaluator

reporters:
  - type: console

Step 1: Create Test Cases​

Step 2: Create Configuration​

Step 3: Run Evaluation​

Understanding the Output​

What Each Part Means​

Step 4: Try Different Outputs​

Save to JSON​

Generate HTML Report​

Step 5: Add More Evaluators​

Common Patterns​

Test Multiple Models​

Multi-Turn Conversations​

System Prompts​

What You've Learned​

Next Steps​

Troubleshooting​

Tests Not Passing​

High Costs​

Slow Execution​

Quick Reference​

File Structure​

Basic Commands​

Basic Config Template​