Evaluator Config Override
Master per-test-case configuration overrides to fine-tune evaluation criteria for different scenarios.
Overview
Location: examples/05-evaluator-config-override/
Difficulty: Intermediate
What You'll Learn:
- Two-level configuration system (global + per-test)
- Overriding evaluator settings per test case
- Configuration merging behavior
- Use cases for different override strategies
The Problem
Different tests have different requirements:
- Exact commands: "Say 'Hello World'" needs 100% match
- Creative content: Essay writing allows variation
- Simple queries: Should be fast (<5s)
- Complex analysis: Needs more time and cost budget
One global configuration can't handle all scenarios!
Solution: Two-Level Configuration
- Global Config (
config.yaml) - Defaults for all tests - Per-Test Config (
evalset.json→evaluator_config) - Override for specific tests
Per-test settings override global settings with deep merging.
Files
05-evaluator-config-override/
├── config.yaml # Global defaults
├── test_cases.evalset.json # Tests with overrides
├── run.sh # Runner script
├── run_evaluation.py # Python runner
└── README.md # Instructions
Global Configuration
config.yaml sets defaults for all tests:
evaluators:
- type: response_evaluator
enabled: true
config:
similarity_threshold: 0.6 # Default: 60% similarity
match_type: semantic # Default: semantic matching
case_sensitive: false # Default: ignore case
normalize_whitespace: true
- type: latency_evaluator
enabled: true
config:
max_latency_seconds: 30 # Default: max 30 seconds
warn_threshold_seconds: 10
- type: cost_evaluator
enabled: true
config:
max_cost_per_case: 0.10 # Default: max $0.10
Per-Test Overrides
Test 1: Uses Global Defaults
{
"eval_id": "test_001_uses_defaults",
"conversation": [...],
// NO evaluator_config field
// Uses all global defaults
}
Effective Config:
- Threshold: 0.6 (from global)
- Match type: semantic (from global)
- Max latency: 30s (from global)
- Max cost: $0.10 (from global)
Test 2: Strict Exact Match
{
"eval_id": "test_002_strict_exact_match",
"conversation": [...],
"evaluator_config": {
"ResponseEvaluator": {
"match_type": "exact",
"similarity_threshold": 1.0,
"case_sensitive": true,
"normalize_whitespace": false
}
}
}
Effective Config:
- ✅ Threshold: 1.0 (overridden - requires 100% match)
- ✅ Match type: exact (overridden)
- ✅ Case sensitive: true (overridden)
- ❌ Max latency: 30s (inherited from global)
Use Case: Commands that must be exact: "Say 'Hello World'"
Test 3: Lenient Creative Content
{
"eval_id": "test_003_lenient_creative",
"conversation": [...],
"evaluator_config": {
"ResponseEvaluator": {
"match_type": "recall",
"similarity_threshold": 0.3
}
}
}
Effective Config:
- ✅ Threshold: 0.3 (overridden - only 30% match needed)
- ✅ Match type: recall (overridden - doesn't penalize extra content)
- ❌ Case sensitive: false (inherited)
Use Case: Creative writing, essays, brainstorming
Test 4: High Precision Factual
{
"eval_id": "test_004_high_precision_factual",
"conversation": [...],
"evaluator_config": {
"ResponseEvaluator": {
"match_type": "rouge",
"similarity_threshold": 0.85
}
}
}
Effective Config:
- ✅ Threshold: 0.85 (overridden - high precision)
- ✅ Match type: rouge (overridden - ROUGE-1 metric)
Use Case: Factual Q&A, knowledge retrieval
Test 5: Override Multiple Evaluators
{
"eval_id": "test_005_fast_latency_required",
"conversation": [...],
"evaluator_config": {
"LatencyEvaluator": {
"max_latency_seconds": 5,
"warn_threshold_seconds": 2
},
"ResponseEvaluator": {
"match_type": "exact",
"similarity_threshold": 1.0
}
}
}
Effective Config:
- ✅ Max latency: 5s (overridden from 30s)
- ✅ Warn threshold: 2s (overridden from 10s)
- ✅ Match type: exact (overridden)
- ✅ Threshold: 1.0 (overridden)
Use Case: Simple queries requiring fast, exact responses
Test 6: Complex Operation
{
"eval_id": "test_006_expensive_operation_allowed",
"conversation": [...],
"evaluator_config": {
"CostEvaluator": {
"max_cost_per_case": 0.50
},
"LatencyEvaluator": {
"max_latency_seconds": 120,
"warn_threshold_seconds": 60
}
}
}
Effective Config:
- ✅ Max cost: $0.50 (overridden from $0.10)
- ✅ Max latency: 120s (overridden from 30s)
Use Case: Complex analysis, research tasks, code generation
How Configuration Merging Works
Merge Algorithm
def get_config(self, eval_config: Optional[Dict] = None) -> Dict:
if eval_config is None:
return self.config.copy() # Use global only
# Deep merge: per-test overrides global
merged = self.config.copy()
merged.update(eval_config) # Per-test wins
return merged
Merge Example
Global:
{
"similarity_threshold": 0.6,
"match_type": "semantic",
"case_sensitive": false,
"normalize_whitespace": true
}
Per-Test:
{
"match_type": "exact",
"similarity_threshold": 1.0
}
Merged Result:
{
"similarity_threshold": 1.0, # ✅ Overridden
"match_type": "exact", # ✅ Overridden
"case_sensitive": false, # ❌ Inherited from global
"normalize_whitespace": true # ❌ Inherited from global
}
Important: Evaluator Name vs Type
In global config (YAML):
evaluators:
- type: response_evaluator # Use type name
In per-test override (JSON):
{
"evaluator_config": {
"ResponseEvaluator": { // Use class name!
"similarity_threshold": 0.9
}
}
}
Evaluator Name Mapping:
type: response_evaluator→ResponseEvaluatortype: cost_evaluator→CostEvaluatortype: latency_evaluator→LatencyEvaluatortype: trajectory_evaluator→TrajectoryEvaluator- Custom evaluators → Use your class name
Running the Example
cd examples/05-evaluator-config-override
judge-llm run --config config.yaml
Expected Output
Starting evaluation...
test_001_uses_defaults:
ResponseEvaluator: threshold=0.6, match_type=semantic (global defaults)
✓ PASSED
test_002_strict_exact_match:
ResponseEvaluator: threshold=1.0, match_type=exact (overridden)
✓ PASSED
test_003_lenient_creative:
ResponseEvaluator: threshold=0.3, match_type=recall (overridden)
✓ PASSED
test_004_high_precision_factual:
ResponseEvaluator: threshold=0.85, match_type=rouge (overridden)
✓ PASSED
test_005_fast_latency_required:
ResponseEvaluator: threshold=1.0, match_type=exact (overridden)
LatencyEvaluator: max=5s (overridden)
✓ PASSED
test_006_expensive_operation_allowed:
CostEvaluator: max=$0.50 (overridden)
LatencyEvaluator: max=120s (overridden)
✓ PASSED
Summary:
Total: 6
Passed: 6
Failed: 0
Success Rate: 100%
Use Cases by Test Type
| Test Type | Override Strategy | Configuration |
|---|---|---|
| Exact commands | 100% match required | match_type: exact, threshold: 1.0 |
| Creative content | Allow variation | match_type: recall, threshold: 0.3 |
| Factual Q&A | High precision | match_type: rouge, threshold: 0.85 |
| Simple queries | Fast response | Low latency limits |
| Complex tasks | More resources | Higher cost/latency limits |
| Safety-critical | Specific checks | Custom evaluator config |
Common Override Patterns
Pattern 1: Strict Validation
{
"evaluator_config": {
"ResponseEvaluator": {
"match_type": "exact",
"similarity_threshold": 1.0,
"case_sensitive": true,
"normalize_whitespace": false
}
}
}
For: Commands, code snippets, structured output
Pattern 2: Lenient Validation
{
"evaluator_config": {
"ResponseEvaluator": {
"match_type": "recall",
"similarity_threshold": 0.3,
"case_sensitive": false
}
}
}
For: Essays, creative writing, brainstorming
Pattern 3: Performance Critical
{
"evaluator_config": {
"LatencyEvaluator": {
"max_latency_seconds": 3,
"warn_threshold_seconds": 1
},
"CostEvaluator": {
"max_cost_per_case": 0.01
}
}
}
For: High-volume, real-time applications
Pattern 4: Complex Analysis
{
"evaluator_config": {
"LatencyEvaluator": {
"max_latency_seconds": 180
},
"CostEvaluator": {
"max_cost_per_case": 1.00
}
}
}
For: Research, code generation, complex reasoning
Pattern 5: Safety Critical
{
"evaluator_config": {
"SafetyEvaluator": {
"severity_threshold": "low",
"allowed_safety_issues": 0,
"check_pii": true,
"check_toxicity": true
}
}
}
For: User-facing, regulated content
Best Practices
1. Set Reasonable Global Defaults
# Global config should work for MOST tests
evaluators:
- type: response_evaluator
config:
similarity_threshold: 0.7 # Balanced default
match_type: semantic # Flexible matching
2. Override Only When Necessary
// Good: Only override what's needed
{
"evaluator_config": {
"ResponseEvaluator": {
"similarity_threshold": 0.9
}
}
}
// Bad: Duplicating global settings
{
"evaluator_config": {
"ResponseEvaluator": {
"similarity_threshold": 0.9,
"match_type": "semantic", // Already in global
"case_sensitive": false, // Already in global
"normalize_whitespace": true // Already in global
}
}
}
3. Document Why You Override
{
"eval_id": "strict_command_test",
"description": "Testing exact command replication - requires 100% match",
"evaluator_config": {
"ResponseEvaluator": {
"match_type": "exact",
"similarity_threshold": 1.0
}
}
}
4. Group Similar Tests
If many tests share overrides, create separate evalset files:
tests/
├── strict_tests.evalset.json # All use exact matching
├── creative_tests.evalset.json # All use lenient matching
└── performance_tests.evalset.json # All have strict latency
5. Verify Overrides Are Applied
# Debug mode shows effective configuration
judge-llm run --config test.yaml --verbose --show-config
Debugging
Show Effective Configuration
judge-llm run --config config.yaml --show-config
Shows merged configuration before running.
Trace Configuration Sources
judge-llm run --config config.yaml --trace-config
Shows where each value comes from (global vs per-test).
Common Issues
Issue: Override not applied
Solution: Check evaluator name (use class name, not type):
{
"evaluator_config": {
"ResponseEvaluator": { // ✅ Correct
...
}
}
}
{
"evaluator_config": {
"response_evaluator": { // ❌ Wrong
...
}
}
}
Next Steps
After mastering config overrides:
- Safety Evaluation - Complex per-test safety settings
- Database Tracking - Track configuration effectiveness
- Default Config - Combine defaults with overrides