Basic Usage
Get started with Judge LLM quickly - from installation to your first evaluation in minutes.
Installation
Install Judge LLM using pip:
pip install judge-llm
Verify installation:
judge-llm --version
Quick Start
1. Set Up API Keys
Create a .env file in your project root:
# .env
GEMINI_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
Or set environment variables:
export GEMINI_API_KEY=your_key
export OPENAI_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key
2. Create Test Cases
Create a tests.json file with your test cases:
[
{
"eval_id": "test_001",
"turns": [
{
"role": "user",
"content": "What is 2+2?"
},
{
"role": "assistant",
"content": "4",
"expected": true
}
]
},
{
"eval_id": "test_002",
"turns": [
{
"role": "user",
"content": "What is the capital of France?"
},
{
"role": "assistant",
"content": "Paris",
"expected": true
}
]
}
]
3. Create Configuration
Create a test.yaml configuration file:
dataset:
loader: local_file
paths:
- ./tests.json
providers:
- type: gemini
agent_id: my_agent
evaluators:
- type: response_evaluator
reporters:
- type: console
4. Run Evaluation
judge-llm run --config test.yaml
Output:
Starting evaluation...
Evaluation Progress:
test_001: ✓ PASSED (cost: $0.0012, time: 1.2s)
test_002: ✓ PASSED (cost: $0.0015, time: 1.5s)
Summary:
Total Tests: 2
Passed: 2
Failed: 0
Success Rate: 100.0%
Total Cost: $0.0027
Total Time: 2.7s
Using Python API
You can also run evaluations programmatically:
from judge_llm import evaluate
report = evaluate(
dataset={
"loader": "local_file",
"paths": ["./tests.json"]
},
providers=[
{"type": "gemini", "agent_id": "my_agent"}
],
evaluators=[
{"type": "response_evaluator"}
],
reporters=[
{"type": "console"}
]
)
# Access results
print(f"Success Rate: {report.success_rate * 100:.1f}%")
print(f"Total Cost: ${report.total_cost:.4f}")
print(f"Tests Passed: {report.summary['successful_executions']}")
Common Usage Patterns
Single Provider Evaluation
Test one model:
dataset:
loader: local_file
paths: [./tests.json]
providers:
- type: gemini
agent_id: gemini_agent
evaluators:
- type: response_evaluator
reporters:
- type: console
judge-llm run --config test.yaml
Multiple Providers (A/B Testing)
Compare multiple models:
dataset:
loader: local_file
paths: [./tests.json]
providers:
- type: gemini
agent_id: gemini
- type: openai
agent_id: openai
evaluators:
- type: response_evaluator
reporters:
- type: console
- type: html
output_path: ./comparison.html
Multiple Evaluators
Use multiple evaluation criteria:
dataset:
loader: local_file
paths: [./tests.json]
providers:
- type: gemini
agent_id: test_agent
evaluators:
- type: response_evaluator
- type: cost_evaluator
max_cost: 0.01
- type: latency_evaluator
max_latency: 3.0
reporters:
- type: console
Multiple Output Formats
Generate multiple report types:
dataset:
loader: local_file
paths: [./tests.json]
providers:
- type: gemini
agent_id: test_agent
evaluators:
- type: response_evaluator
reporters:
- type: console
- type: json
output_path: ./results.json
- type: html
output_path: ./report.html
- type: database
db_path: ./results.db
Test Case Format
Single Turn
{
"eval_id": "simple_test",
"turns": [
{
"role": "user",
"content": "What is 2+2?"
},
{
"role": "assistant",
"content": "4",
"expected": true
}
]
}
Multi-Turn Conversation
{
"eval_id": "conversation_test",
"turns": [
{
"role": "user",
"content": "What is 2+2?"
},
{
"role": "assistant",
"content": "4",
"expected": true
},
{
"role": "user",
"content": "And what is 4+4?"
},
{
"role": "assistant",
"content": "8",
"expected": true
}
]
}
With System Prompt
{
"eval_id": "system_prompt_test",
"turns": [
{
"role": "system",
"content": "You are a helpful math tutor."
},
{
"role": "user",
"content": "What is 2+2?"
},
{
"role": "assistant",
"content": "4",
"expected": true
}
]
}
Best Practices
1. Start Simple
Begin with a few test cases and one provider:
dataset:
loader: local_file
paths: [./tests.json] # Start with 5-10 tests
providers:
- type: gemini
agent_id: test
evaluators:
- type: response_evaluator
reporters:
- type: console
2. Use Environment Variables
Never commit API keys:
# Good
providers:
- type: gemini
api_key: ${GEMINI_API_KEY}
# Bad
providers:
- type: gemini
api_key: "AIzaSy..." # Don't do this!
3. Iterate Incrementally
- Start with basic response evaluation
- Add cost/latency checks
- Add custom evaluators
- Expand test coverage
4. Version Control Your Tests
git add tests.json test.yaml
git commit -m "Add test cases for feature X"
5. Monitor Costs
Add cost evaluator to prevent surprises:
evaluators:
- type: response_evaluator
- type: cost_evaluator
max_cost: 0.01 # Fail if cost > $0.01 per test
Common Mistakes
Missing API Keys
Error: API key not found for provider: gemini
Solution: Set environment variable or create .env file
Invalid Test Format
Error: Invalid test case format
Solution: Ensure each test has eval_id and turns fields
File Not Found
Error: Test file not found: ./tests.json
Solution: Check file path is correct relative to config file
Wrong Provider Type
Error: Unknown provider type: gpt
Solution: Use correct provider names: gemini, openai, anthropic
Getting Help
List Available Components
# List providers
judge-llm list providers
# List evaluators
judge-llm list evaluators
# List reporters
judge-llm list reporters
Validate Configuration
judge-llm validate --config test.yaml
View Documentation
judge-llm --help
judge-llm run --help
Next Steps
- Configuration Guide - Learn all configuration options
- CLI Reference - Complete CLI documentation
- Python API - Programmatic usage
- Examples - Working examples for common scenarios
- Custom Evaluators - Build custom evaluators
- Custom Reporters - Build custom reporters