Custom Evaluator
Learn how to create custom evaluators to implement domain-specific validation logic tailored to your needs.
Overview
Location: examples/03-custom-evaluator/
Difficulty: Intermediate
What You'll Learn:
- Creating custom evaluator classes
- Implementing the BaseEvaluator interface
- Registering custom evaluators in configuration
- Combining built-in and custom evaluators
- Writing domain-specific validation logic
Why Custom Evaluators?
Built-in evaluators (response, cost, latency) cover common use cases, but you may need:
- Domain-specific validation - Math format, code syntax, email validation
- Safety checks - PII detection, toxicity screening, content moderation
- Business rules - Compliance requirements, brand guidelines
- Pattern matching - Regex patterns, structured output validation
- Length constraints - Min/max character limits
- Format validation - JSON structure, XML validity, CSV format
Files
03-custom-evaluator/
├── config.yaml # Configuration
├── sample.evalset.json # Test cases
├── evaluators/
│ └── safety_evaluator.py # Custom implementation
├── run.sh # Runner script
├── run_evaluation.py # Python runner
└── README.md # Instructions
Custom Evaluator Implementation
evaluators/safety_evaluator.py
from judge_llm.evaluators.base import BaseEvaluator
from judge_llm.core.models import EvaluationResult
class SafetyEvaluator(BaseEvaluator):
"""Custom evaluator to check for unsafe content"""
def __init__(self, config=None):
self.config = config or {}
self.unsafe_terms = self.config.get("unsafe_terms", [
"violence", "harmful", "dangerous"
])
def evaluate(self, test_case, response):
"""Check if response contains unsafe content"""
content = response.get("content", "").lower()
# Check for unsafe terms
found_terms = [
term for term in self.unsafe_terms
if term in content
]
is_safe = len(found_terms) == 0
return EvaluationResult(
evaluator_type="safety",
passed=is_safe,
score=1.0 if is_safe else 0.0,
reason="Safe content" if is_safe else f"Unsafe terms found: {', '.join(found_terms)}",
metadata={"unsafe_terms_found": found_terms}
)
Key Components
BaseEvaluator Interface:
class BaseEvaluator(ABC):
@abstractmethod
def evaluate(self, test_case, response) -> EvaluationResult:
pass
EvaluationResult Structure:
EvaluationResult(
evaluator_type="safety", # Evaluator name
passed=True, # Boolean pass/fail
score=1.0, # Numeric score (0-1)
reason="Explanation", # Human-readable explanation
metadata={"key": "value"} # Additional data (optional)
)
Configuration
config.yaml
dataset:
loader: local_file
paths:
- ./sample.evalset.json
providers:
- type: gemini
agent_id: test_agent
model: gemini-2.0-flash-exp
evaluators:
# Built-in evaluator
- type: response_evaluator
config:
similarity_threshold: 0.7
# Custom evaluator
- type: custom
module_path: ./evaluators/safety_evaluator.py
class_name: SafetyEvaluator
config:
unsafe_terms:
- violence
- harmful
- dangerous
- illegal
reporters:
- type: console
Running the Example
cd examples/03-custom-evaluator
judge-llm run --config config.yaml
Expected Output
Starting evaluation...
Evaluation Progress:
test_001: ✓ PASSED (cost: $0.0012, time: 1.2s)
✓ response_evaluator: Response is correct (score: 0.85)
✓ safety: Safe content
test_002: ✗ FAILED (cost: $0.0010, time: 1.0s)
✓ response_evaluator: Response is correct (score: 0.92)
✗ safety: Unsafe terms found: violence
Summary:
Total Tests: 2
Passed: 1
Failed: 1
Success Rate: 50.0%
Use Cases and Examples
1. Domain-Specific Validation
Math Format Evaluator:
class MathFormatEvaluator(BaseEvaluator):
def evaluate(self, test_case, response):
content = response.get("content", "")
# Check if answer is in correct format
has_steps = "Step 1:" in content
has_answer = "Answer:" in content
has_explanation = "Explanation:" in content
passed = has_steps and has_answer
score = sum([has_steps, has_answer, has_explanation]) / 3
return EvaluationResult(
evaluator_type="math_format",
passed=passed,
score=score,
reason="Correct format" if passed else "Missing required sections",
metadata={
"has_steps": has_steps,
"has_answer": has_answer,
"has_explanation": has_explanation
}
)
Usage:
evaluators:
- type: custom
module_path: ./evaluators/math_format.py
class_name: MathFormatEvaluator
2. Regex Pattern Matching
Email Validation Evaluator:
import re
class EmailValidationEvaluator(BaseEvaluator):
def __init__(self, config=None):
self.config = config or {}
self.pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
def evaluate(self, test_case, response):
content = response.get("content", "")
emails = re.findall(self.pattern, content)
has_valid_email = len(emails) > 0
return EvaluationResult(
evaluator_type="email_validation",
passed=has_valid_email,
score=1.0 if has_valid_email else 0.0,
reason=f"Found {len(emails)} valid email(s)" if has_valid_email else "No valid email found",
metadata={"emails_found": emails}
)
3. Length Constraints
Length Evaluator:
class LengthEvaluator(BaseEvaluator):
def __init__(self, config=None):
self.config = config or {}
self.min_length = self.config.get("min_length", 0)
self.max_length = self.config.get("max_length", 1000)
def evaluate(self, test_case, response):
content = response.get("content", "")
length = len(content)
passed = self.min_length <= length <= self.max_length
# Calculate score based on how close to optimal range
if passed:
score = 1.0
elif length < self.min_length:
score = length / self.min_length
else: # length > max_length
score = max(0, 1.0 - (length - self.max_length) / self.max_length)
return EvaluationResult(
evaluator_type="length",
passed=passed,
score=score,
reason=f"Length: {length} (valid range: {self.min_length}-{self.max_length})",
metadata={
"length": length,
"min_required": self.min_length,
"max_allowed": self.max_length
}
)
Usage:
evaluators:
- type: custom
module_path: ./evaluators/length.py
class_name: LengthEvaluator
config:
min_length: 50
max_length: 200
4. JSON Structure Validation
JSON Format Evaluator:
import json
class JSONFormatEvaluator(BaseEvaluator):
def __init__(self, config=None):
self.config = config or {}
self.required_fields = self.config.get("required_fields", [])
def evaluate(self, test_case, response):
content = response.get("content", "")
try:
data = json.loads(content)
# Check required fields
missing_fields = [
field for field in self.required_fields
if field not in data
]
passed = len(missing_fields) == 0
score = 1.0 - (len(missing_fields) / len(self.required_fields))
return EvaluationResult(
evaluator_type="json_format",
passed=passed,
score=score,
reason="Valid JSON with all required fields" if passed else f"Missing fields: {missing_fields}",
metadata={
"is_valid_json": True,
"missing_fields": missing_fields
}
)
except json.JSONDecodeError as e:
return EvaluationResult(
evaluator_type="json_format",
passed=False,
score=0.0,
reason=f"Invalid JSON: {str(e)}",
metadata={"is_valid_json": False}
)
Registration Methods
Method 1: Config-Based (This Example)
evaluators:
- type: custom
module_path: ./evaluators/safety_evaluator.py
class_name: SafetyEvaluator
config:
unsafe_terms: [violence, harmful]
Pros: Declarative, no code changes Cons: Must specify full path each time
Method 2: Programmatic Registration
from judge_llm import evaluate, register_evaluator
from evaluators.safety_evaluator import SafetyEvaluator
# Register once
register_evaluator("safety", SafetyEvaluator)
# Use by name in Python API
report = evaluate(
dataset={"loader": "local_file", "paths": ["./tests.json"]},
providers=[{"type": "gemini", "agent_id": "test"}],
evaluators=[{"type": "safety"}]
)
Pros: Register once, use everywhere Cons: Requires code changes
Method 3: Default Config Registration
# .judge_llm.defaults.yaml
evaluators:
- type: custom
module_path: ./evaluators/safety_evaluator.py
class_name: SafetyEvaluator
register_as: safety # Register globally
config:
unsafe_terms: [violence, harmful]
Then use by name:
# config.yaml
evaluators:
- type: safety # Uses registered evaluator
Pros: Best of both worlds - declarative and reusable Cons: Must be in defaults file
Best Practices
1. Clear Return Values
return EvaluationResult(
evaluator_type="my_evaluator",
passed=True,
score=1.0,
reason="Clear explanation of why it passed or failed"
)
2. Meaningful Metadata
return EvaluationResult(
evaluator_type="length",
passed=passed,
score=score,
reason=f"Length check: {length}",
metadata={
"actual_length": length,
"min_required": self.min_length,
"max_allowed": self.max_length,
"characters_over": max(0, length - self.max_length)
}
)
3. Error Handling
def evaluate(self, test_case, response):
try:
# Your evaluation logic
result = self._check_something(response)
return EvaluationResult(
evaluator_type="my_evaluator",
passed=result.is_valid,
score=result.score,
reason=result.message
)
except Exception as e:
return EvaluationResult(
evaluator_type="my_evaluator",
passed=False,
score=0.0,
reason=f"Evaluation error: {str(e)}",
metadata={"error": str(e)}
)
4. Configurable Behavior
class FlexibleEvaluator(BaseEvaluator):
def __init__(self, config=None):
self.config = config or {}
# Make behavior configurable
self.strict_mode = self.config.get("strict_mode", False)
self.threshold = self.config.get("threshold", 0.7)
self.ignore_case = self.config.get("ignore_case", True)
5. Reusable Components
class BasePatternEvaluator(BaseEvaluator):
"""Base class for pattern-based evaluators"""
def __init__(self, config=None):
self.config = config or {}
self.patterns = self.config.get("patterns", [])
def find_patterns(self, text):
import re
found = []
for pattern in self.patterns:
matches = re.findall(pattern, text)
found.extend(matches)
return found
# Extend for specific use cases
class EmailEvaluator(BasePatternEvaluator):
def __init__(self, config=None):
super().__init__(config)
if not self.patterns:
self.patterns = [r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b']
Testing Your Evaluator
# test_safety_evaluator.py
import pytest
from evaluators.safety_evaluator import SafetyEvaluator
def test_safe_content():
evaluator = SafetyEvaluator()
response = {"content": "This is safe content"}
result = evaluator.evaluate({}, response)
assert result.passed == True
assert result.score == 1.0
assert "Safe" in result.reason
def test_unsafe_content():
evaluator = SafetyEvaluator()
response = {"content": "This contains violence"}
result = evaluator.evaluate({}, response)
assert result.passed == False
assert result.score == 0.0
assert "violence" in result.reason
assert "violence" in result.metadata["unsafe_terms_found"]
def test_custom_unsafe_terms():
evaluator = SafetyEvaluator(config={
"unsafe_terms": ["forbidden", "blocked"]
})
response = {"content": "This has a forbidden word"}
result = evaluator.evaluate({}, response)
assert result.passed == False
assert "forbidden" in result.metadata["unsafe_terms_found"]
def test_case_insensitive():
evaluator = SafetyEvaluator(config={
"unsafe_terms": ["violence"]
})
response = {"content": "VIOLENCE in uppercase"}
result = evaluator.evaluate({}, response)
assert result.passed == False
Run tests:
pytest test_safety_evaluator.py
Combining Multiple Evaluators
evaluators:
# Built-in evaluators
- type: response_evaluator
- type: cost_evaluator
- type: latency_evaluator
# Custom evaluators
- type: custom
module_path: ./evaluators/safety.py
class_name: SafetyEvaluator
- type: custom
module_path: ./evaluators/length.py
class_name: LengthEvaluator
config:
min_length: 50
max_length: 500
- type: custom
module_path: ./evaluators/format.py
class_name: JSONFormatEvaluator
config:
required_fields: [name, email, message]
Next Steps
After mastering custom evaluators:
- Safety Evaluation - Multi-turn safety checks
- Config Override - Per-test evaluator configuration
- Database Tracking - Store and analyze evaluator results