Skip to main content

Configuration Guide

Comprehensive guide to configuring Judge LLM evaluations using YAML files.

Configuration File Structure

Judge LLM uses YAML configuration files to define evaluations. The basic structure includes four main sections:

dataset:
# Data loading configuration

providers:
# LLM provider(s) to evaluate

evaluators:
# Evaluation criteria

reporters:
# Output format(s)

Dataset Configuration

The dataset section specifies where to load test cases from.

Local File Loader

Load test cases from JSON files on your local filesystem.

dataset:
loader: local_file
paths:
- ./tests.json
- ./more_tests.json

Test File Format:

[
{
"eval_id": "test_001",
"turns": [
{
"role": "user",
"content": "What is 2+2?"
},
{
"role": "assistant",
"content": "4",
"expected": true
}
]
}
]

Directory Loader

Load test cases from all matching files in a directory and its subdirectories. Files are discovered recursively, and results are grouped by directory structure in reports.

dataset:
loader: directory
paths:
- ./data/eval_sets
pattern: "*.json" # default; also supports "*.yaml", "*.yml", or custom patterns like "eval*.json"

Example directory structure:

eval_sets/
├── basic/
│ ├── greetings.json
│ └── math.json
├── advanced/
│ ├── reasoning/
│ │ └── multi_step.json
│ └── coding.json
└── safety.json

All matching files are discovered recursively. Each loaded eval set tracks its source_path (relative to the root directory), which flows through to execution runs and reports. The console, HTML, database, and dashboard reporters all group results by directory structure.

BrowserBase Loader

Load test cases from BrowserBase sessions (for web-based testing).

dataset:
loader: browserbase
api_key: ${BROWSERBASE_API_KEY}
project_id: ${BROWSERBASE_PROJECT_ID}

Provider Configuration

The providers section defines which LLM(s) to evaluate. You can configure single or multiple providers to test and compare different models.

Gemini Provider

Google's Gemini models via the official API.

providers:
- type: gemini
agent_id: gemini_agent
model: gemini-2.0-flash-exp
temperature: 0.0
max_tokens: 8192
top_p: 0.95
top_k: 40
api_key: ${GOOGLE_API_KEY}

Configuration Options:

OptionDescriptionDefaultRequired
typeProvider type (gemini)-Yes
agent_idUnique identifier for this agent-Yes
modelModel namegemini-2.0-flash-expNo
temperatureSampling temperature (0-1)1.0No
max_tokensMaximum response tokens8192No
top_pTop-p sampling (0-1)0.95No
top_kTop-k sampling40No
api_keyAPI key${GOOGLE_API_KEY}No

Supported Models:

  • gemini-2.0-flash-exp - Latest experimental flash model (recommended)
  • gemini-1.5-flash - Fast, cost-effective model
  • gemini-1.5-pro - High capability model
  • gemini-1.5-flash-8b - Ultra-fast, lightweight model

Note: Requires GOOGLE_API_KEY environment variable or explicit api_key parameter.

Mock Provider

Built-in test provider that returns expected responses without API calls. Perfect for testing and development.

providers:
- type: mock
agent_id: test_agent

Configuration Options:

OptionDescriptionDefaultRequired
typeProvider type (mock)-Yes
agent_idUnique identifier-Yes

Features:

  • No API calls - returns conversation history from test cases
  • Instant execution - no network latency
  • Cost calculation - mock token counting for testing
  • No authentication required

Use Cases:

  • Development and testing
  • CI/CD without API costs
  • Validating test case structure
  • Framework development

Google ADK Provider

Google's Agent Development Kit for building AI agents with tool use.

providers:
- type: google_adk
agent_id: adk_agent
agent_metadata:
module_path: "my_agent.agent"
agent_name: "root_agent"
root_path: "."

Configuration Options:

OptionDescriptionDefaultRequired
typeProvider type (google_adk)-Yes
agent_idUnique identifier-Yes
agent_metadata.module_pathPython module path to agent-Yes
agent_metadata.agent_nameAgent variable name in moduleroot_agentNo
agent_metadata.root_pathRoot directory for imports.No

Features:

  • Async agent execution
  • Tool calling support
  • Multi-turn conversations
  • Session management
  • Thread-safe agent caching

See the examples/09-google-adk-agent directory in the repository for a complete setup guide.

OpenAI Provider

OpenAI's GPT models (requires custom implementation).

providers:
- type: openai
agent_id: openai_agent
model: gpt-4
temperature: 0.7
max_tokens: 2048
api_key: ${OPENAI_API_KEY}

Configuration Options:

OptionDescriptionDefaultRequired
typeProvider type (openai)-Yes
agent_idUnique identifier-Yes
modelModel namegpt-4No
temperatureSampling temperature (0-1)0.7No
max_tokensMaximum response tokens2048No
api_keyAPI key${OPENAI_API_KEY}No

Note: Requires custom provider implementation. See Custom Providers below.

Anthropic Provider

Anthropic's Claude models (requires custom implementation).

providers:
- type: anthropic
agent_id: claude_agent
model: claude-3-5-sonnet-20241022
temperature: 0.0
max_tokens: 4096
api_key: ${ANTHROPIC_API_KEY}

Configuration Options:

OptionDescriptionDefaultRequired
typeProvider type (anthropic)-Yes
agent_idUnique identifier-Yes
modelModel nameclaude-3-5-sonnet-20241022No
temperatureSampling temperature (0-1)1.0No
max_tokensMaximum response tokens4096No
api_keyAPI key${ANTHROPIC_API_KEY}No

Note: Requires custom provider implementation. See Custom Providers below.

Multiple Providers (A/B Testing)

Compare multiple models in a single evaluation run. Each provider evaluates the same test cases independently.

providers:
- type: gemini
agent_id: gemini_flash
model: gemini-2.0-flash-exp
temperature: 0.0

- type: gemini
agent_id: gemini_pro
model: gemini-1.5-pro
temperature: 0.0

- type: mock
agent_id: baseline

Benefits:

  • Compare response quality across models
  • Compare cost and latency
  • Identify best model for your use case
  • Regression testing when switching models

Output: Each provider generates separate execution runs in the report, allowing side-by-side comparison.

Custom Providers

Implement custom providers for any LLM API or service.

providers:
- type: custom
module_path: ./providers/my_provider.py
class_name: MyCustomProvider
agent_id: custom_agent
# Custom config options passed to provider
endpoint: https://api.example.com
api_key: ${CUSTOM_API_KEY}
timeout: 30

Implementation Example:

# providers/my_provider.py
from judge_llm.providers.base import BaseProvider
from judge_llm.core.models import EvalCase, ProviderResult, Invocation, Content, Part

class MyCustomProvider(BaseProvider):
def __init__(self, agent_id, agent_config_path=None, agent_metadata=None, **provider_metadata):
super().__init__(agent_id, agent_config_path, agent_metadata, **provider_metadata)

# Access custom config
self.endpoint = provider_metadata.get("endpoint")
self.api_key = provider_metadata.get("api_key")
self.timeout = provider_metadata.get("timeout", 30)

def execute(self, eval_case: EvalCase) -> ProviderResult:
"""Execute evaluation case using your custom LLM."""
try:
# Call your LLM API
response = self._call_llm_api(eval_case)

# Build conversation history
conversation_history = [
Invocation(
invocation_id=inv.invocation_id,
user_content=inv.user_content,
final_response=Content(
parts=[Part(text=response["text"])],
role="model"
),
intermediate_data=inv.intermediate_data,
creation_timestamp=inv.creation_timestamp
)
for inv in eval_case.conversation
]

return ProviderResult(
conversation_history=conversation_history,
cost=response.get("cost", 0.0),
token_usage=response.get("tokens", {}),
metadata={
"provider": "custom",
"agent_id": self.agent_id,
"model": response.get("model")
},
success=True
)

except Exception as e:
return ProviderResult(
conversation_history=[],
success=False,
error=str(e)
)

def _call_llm_api(self, eval_case):
"""Call your custom LLM API."""
# Your implementation here
pass

def cleanup(self):
"""Cleanup resources."""
pass

Registration Options:

  1. Inline in config (shown above)
  2. Register globally in .judge_llm.defaults.yaml:
# .judge_llm.defaults.yaml
providers:
- type: custom
module_path: ./providers/my_provider.py
class_name: MyCustomProvider
register_as: my_provider # ← Register globally

# Then use by name in test configs
providers:
- type: my_provider # ← Uses registered provider
agent_id: test_agent

See Custom Component Registration for more details.

Agent Configuration

The agent section controls execution behavior and quality gates.

agent:
num_runs: 1 # Number of times to run each test case
parallel_execution: false # Run tests in parallel
max_workers: 4 # Max parallel workers (if parallel enabled)
fail_on_threshold_violation: true # Exit with error if evaluations fail
log_level: INFO # Logging level (DEBUG|INFO|WARNING|ERROR)

Configuration Options

OptionDescriptionDefaultType
num_runsNumber of times to execute each eval case1integer
parallel_executionEnable parallel execution of test casesfalseboolean
max_workersMaximum number of parallel worker threads4integer
fail_on_threshold_violationExit with error code when evaluator thresholds are violatedtrueboolean
log_levelLogging verbosity levelINFOstring

fail_on_threshold_violation

Purpose: Controls whether the evaluation process should fail (exit with error code 1) when any evaluator thresholds are violated.

Use Cases:

✅ When to enable (fail_on_threshold_violation: true):

  • CI/CD Pipelines: Fail builds when LLM quality drops below thresholds
  • Regression Testing: Prevent deployments if model performance degrades
  • Quality Gates: Enforce minimum quality standards before production
  • Pre-commit Hooks: Block commits that violate quality thresholds
# CI/CD configuration - fail on violations
agent:
fail_on_threshold_violation: true # Block deployments if quality drops

evaluators:
- type: response_evaluator
config:
similarity_threshold: 0.85 # Minimum 85% similarity required
- type: cost_evaluator
config:
max_cost_per_case: 0.05 # Maximum $0.05 per test

📊 When to disable (fail_on_threshold_violation: false):

  • Monitoring & Reporting: Track metrics over time without failing
  • Exploratory Testing: Test new models/prompts without strict requirements
  • Development: Iterate quickly without strict quality gates
  • Gradual Rollout: Collect data before enforcing thresholds
# Monitoring configuration - collect data without failing
agent:
fail_on_threshold_violation: false # Continue despite violations

reporters:
- type: database
db_path: ./metrics.db # Track trends over time

Error Output Example:

When violations occur with fail_on_threshold_violation: true, you'll see:

================================================================================
THRESHOLD VIOLATION DETECTED
================================================================================

❌ 3/10 evaluation(s) failed to meet thresholds
Success rate: 70.0% (100% required)

Failed evaluation cases:
• test_001 (run 1) - Failed: response, cost
• test_003 (run 1) - Failed: latency
• test_007 (run 1) - Failed: trajectory

================================================================================
💡 TIP: Set 'fail_on_threshold_violation: false' in agent config to continue
despite threshold violations (useful for monitoring/testing)
================================================================================

Exit code: 1 (failure)

Telemetry Configuration

Enable OpenTelemetry tracing for deep observability into evaluation runs.

agent:
telemetry:
enabled: true # Enable tracing (default: false)
exporter: phoenix # "console", "otlp", or "phoenix"
service_name: judge-llm # Service/project name for traces
endpoint: http://localhost:6006 # Exporter endpoint (optional)
OptionDescriptionDefaultType
telemetry.enabledEnable OpenTelemetry tracingfalseboolean
telemetry.exporterExporter type (console, otlp, phoenix)consolestring
telemetry.service_nameService name for tracesjudge-llmstring
telemetry.endpointExporter endpoint URLvaries by exporterstring

Can also be enabled via environment variable JUDGE_LLM_TELEMETRY=true or CLI flag --telemetry.

Requires optional dependencies: pip install judge-llm[telemetry] or pip install judge-llm[phoenix].

See the Telemetry Guide for full documentation including span attributes and backend setup.

Parallel Execution

Enable parallel execution to speed up large test suites:

agent:
parallel_execution: true
max_workers: 8 # Use 8 parallel threads
num_runs: 3 # Each test runs 3 times in parallel

Performance Tips:

  • Set max_workers based on your CPU cores (typically 2x cores)
  • Parallel execution reduces wall-clock time but not total execution time
  • Monitor memory usage with large test suites

Evaluator Configuration

The evaluators section defines how responses are evaluated.

Response Evaluator

Evaluates response correctness using an LLM judge.

evaluators:
- type: response_evaluator
llm_provider: gemini
llm_model: gemini-2.0-flash-exp
temperature: 0.0

Configuration Options:

OptionDescriptionDefaultRequired
typeEvaluator type-Yes
llm_providerLLM for judginggeminiNo
llm_modelModel for judgingProvider defaultNo
temperatureTemperature for judge0.0No

Trajectory Evaluator

Evaluates the reasoning process and intermediate steps.

evaluators:
- type: trajectory_evaluator
llm_provider: gemini
llm_model: gemini-2.0-flash-exp

Cost Evaluator

Ensures cost stays within threshold.

evaluators:
- type: cost_evaluator
max_cost: 0.01 # Fail if cost > $0.01

Configuration Options:

OptionDescriptionDefaultRequired
typeEvaluator type-Yes
max_costMaximum cost threshold-Yes

Latency Evaluator

Ensures response time stays within threshold.

evaluators:
- type: latency_evaluator
max_latency: 5.0 # Fail if latency > 5 seconds

Configuration Options:

OptionDescriptionDefaultRequired
typeEvaluator type-Yes
max_latencyMaximum latency (seconds)-Yes

Multiple Evaluators

Combine multiple evaluation criteria:

evaluators:
- type: response_evaluator
llm_provider: gemini

- type: trajectory_evaluator
llm_provider: gemini

- type: cost_evaluator
max_cost: 0.01

- type: latency_evaluator
max_latency: 3.0

All evaluators must pass for a test case to be considered successful.

Custom Evaluators

evaluators:
- type: custom
module_path: ./evaluators/safety.py
class_name: SafetyEvaluator
# Custom config options
strict_mode: true

Reporter Configuration

The reporters section defines how results are output.

Console Reporter

Print results to terminal.

reporters:
- type: console

No additional configuration required.

JSON Reporter

Export results as JSON.

reporters:
- type: json
output_path: ./results.json

Configuration Options:

OptionDescriptionDefaultRequired
typeReporter type-Yes
output_pathPath to JSON file-Yes

HTML Reporter

Generate interactive HTML report.

reporters:
- type: html
output_path: ./report.html

Configuration Options:

OptionDescriptionDefaultRequired
typeReporter type-Yes
output_pathPath to HTML file-Yes

Database Reporter

Store results in SQLite database.

reporters:
- type: database
db_path: ./results.db

Configuration Options:

OptionDescriptionDefaultRequired
typeReporter type-Yes
db_pathPath to SQLite database-Yes

Multiple Reporters

Use multiple output formats simultaneously:

reporters:
- type: console

- type: json
output_path: ./results.json

- type: html
output_path: ./report.html

- type: database
db_path: ./results.db

Custom Reporters

reporters:
- type: custom
module_path: ./reporters/slack.py
class_name: SlackReporter
webhook_url: ${SLACK_WEBHOOK_URL}
channel: "#evals"

Environment Variables

Use environment variables for sensitive data and configuration that changes between environments.

Syntax

Reference environment variables using ${VAR_NAME} syntax:

providers:
- type: gemini
agent_id: ${AGENT_ID}
api_key: ${GEMINI_API_KEY}
model: ${MODEL:-gemini-2.0-flash-exp} # Default value

Loading from .env

Create a .env file in your project root:

# .env
AGENT_ID=my_agent
GEMINI_API_KEY=your_api_key
OPENAI_API_KEY=your_openai_key
MODEL=gemini-2.0-flash-exp

Judge LLM automatically loads .env files when running evaluations.

Environment-Specific Configuration

# dev.yaml
providers:
- type: gemini
agent_id: dev_agent
model: ${DEV_MODEL}
api_key: ${DEV_API_KEY}

# prod.yaml
providers:
- type: gemini
agent_id: prod_agent
model: ${PROD_MODEL}
api_key: ${PROD_API_KEY}

Default Configuration

Create a .judge_llm.defaults.yaml file to define reusable defaults.

Project Defaults

Place in project root: .judge_llm.defaults.yaml

# .judge_llm.defaults.yaml
providers:
- type: gemini
model: gemini-2.0-flash-exp
temperature: 0.0

evaluators:
- type: response_evaluator
llm_provider: gemini

- type: cost_evaluator
max_cost: 0.05

reporters:
- type: console

- type: json
output_path: ./results/latest.json

Global Defaults

Place in home directory: ~/.judge_llm/defaults.yaml

# ~/.judge_llm/defaults.yaml
providers:
- type: gemini
api_key: ${GEMINI_API_KEY}
temperature: 0.0

reporters:
- type: console

Merging Behavior

Defaults are merged with your test config:

  1. Global defaults (~/.judge_llm/defaults.yaml)
  2. Project defaults (.judge_llm.defaults.yaml)
  3. Test config (test.yaml)

Test config values override defaults.

Custom Component Registration

Register custom components in default config for reuse across multiple test configs.

Registering Providers

# .judge_llm.defaults.yaml
providers:
- type: custom
module_path: ./providers/my_provider.py
class_name: MyProvider
register_as: my_provider # ← Register globally

Use by name in test configs:

# test.yaml
providers:
- type: my_provider # ← Use by name
agent_id: test_agent

Registering Evaluators

# .judge_llm.defaults.yaml
evaluators:
- type: custom
module_path: ./evaluators/safety.py
class_name: SafetyEvaluator
register_as: safety

Use by name:

# test.yaml
evaluators:
- type: safety
- type: response_evaluator

Registering Reporters

# .judge_llm.defaults.yaml
reporters:
- type: custom
module_path: ./reporters/csv.py
class_name: CSVReporter
register_as: csv

Use by name:

# test.yaml
reporters:
- type: csv
output_path: ./results.csv

Complete Configuration Examples

Basic Single-Provider Test

dataset:
loader: local_file
paths:
- ./tests.json

providers:
- type: gemini
agent_id: test_agent

evaluators:
- type: response_evaluator

reporters:
- type: console

Multi-Provider A/B Test

dataset:
loader: local_file
paths:
- ./tests.json

providers:
- type: gemini
agent_id: gemini
model: gemini-2.0-flash-exp

- type: openai
agent_id: openai
model: gpt-4

evaluators:
- type: response_evaluator
- type: cost_evaluator
max_cost: 0.01
- type: latency_evaluator
max_latency: 5.0

reporters:
- type: console
- type: html
output_path: ./comparison.html
- type: database
db_path: ./ab_test.db

Production Configuration

dataset:
loader: local_file
paths:
- ./tests/smoke.json
- ./tests/regression.json

providers:
- type: gemini
agent_id: prod_agent
model: ${PROD_MODEL}
temperature: 0.0
api_key: ${GEMINI_API_KEY}

evaluators:
- type: response_evaluator
llm_provider: gemini

- type: trajectory_evaluator
llm_provider: gemini

- type: cost_evaluator
max_cost: 0.05

- type: latency_evaluator
max_latency: 3.0

reporters:
- type: console

- type: json
output_path: ./results/prod-${date}.json

- type: html
output_path: ./reports/prod-${date}.html

- type: database
db_path: ./results.db

Configuration Validation

Validate your configuration before running:

judge-llm validate --config test.yaml

Common validation errors:

ErrorSolution
Missing required fieldAdd the required field to your config
Invalid provider typeCheck provider type name (gemini, openai, anthropic)
Invalid evaluator typeCheck evaluator type name
Invalid reporter typeCheck reporter type name
Invalid file pathEnsure paths are correct and accessible
Missing API keySet environment variable or add to .env

Best Practices

1. Use Environment Variables for Secrets

Bad:

providers:
- type: gemini
api_key: "AIzaSyA..." # Hard-coded secret

Good:

providers:
- type: gemini
api_key: ${GEMINI_API_KEY}

2. Organize Configs by Environment

configs/
├── dev.yaml
├── staging.yaml
└── prod.yaml

3. Use Defaults for Common Settings

Put common settings in .judge_llm.defaults.yaml:

# .judge_llm.defaults.yaml
providers:
- type: gemini
temperature: 0.0
model: gemini-2.0-flash-exp

evaluators:
- type: cost_evaluator
max_cost: 0.05

Keep test configs focused:

# test.yaml
dataset:
loader: local_file
paths: [./tests.json]

providers:
- agent_id: my_test # Other settings from defaults

4. Version Control Configuration

Include in git:

  • Configuration files (.yaml)
  • Default configuration (.judge_llm.defaults.yaml)

Exclude from git (.gitignore):

  • .env (contains secrets)
  • Results files (*.json, *.html, *.db)

5. Document Custom Configuration

Add comments to explain custom settings:

providers:
- type: gemini
agent_id: specialized_agent
temperature: 0.8 # Higher temperature for creative tasks
max_tokens: 4096 # Longer responses needed

Troubleshooting

Configuration Not Found

Error: Configuration file not found

Solution: Check file path and ensure it exists:

ls -la test.yaml

Environment Variable Not Set

Error: API key not found for provider: gemini

Solution: Set the environment variable:

export GEMINI_API_KEY=your_key
# Or add to .env file

Invalid YAML Syntax

Error: YAML parsing error

Solution: Validate YAML syntax:

# Use yamllint or online YAML validator
yamllint test.yaml

Custom Component Not Found

Error: Module not found: ./providers/my_provider.py

Solution: Check module path is correct relative to config file location.