Skip to main content

Custom Evaluator

Learn how to create custom evaluators to implement domain-specific validation logic tailored to your needs.

Overview

Location: examples/03-custom-evaluator/

Difficulty: Intermediate

What You'll Learn:

  • Creating custom evaluator classes
  • Implementing the BaseEvaluator interface
  • Registering custom evaluators in configuration
  • Combining built-in and custom evaluators
  • Writing domain-specific validation logic

Why Custom Evaluators?

Built-in evaluators (response, cost, latency) cover common use cases, but you may need:

  • Domain-specific validation - Math format, code syntax, email validation
  • Safety checks - PII detection, toxicity screening, content moderation
  • Business rules - Compliance requirements, brand guidelines
  • Pattern matching - Regex patterns, structured output validation
  • Length constraints - Min/max character limits
  • Format validation - JSON structure, XML validity, CSV format

Files

03-custom-evaluator/
├── config.yaml # Configuration
├── sample.evalset.json # Test cases
├── evaluators/
│ └── safety_evaluator.py # Custom implementation
├── run.sh # Runner script
├── run_evaluation.py # Python runner
└── README.md # Instructions

Custom Evaluator Implementation

evaluators/safety_evaluator.py

from judge_llm.evaluators.base import BaseEvaluator
from judge_llm.core.models import EvaluationResult

class SafetyEvaluator(BaseEvaluator):
"""Custom evaluator to check for unsafe content"""

def __init__(self, config=None):
self.config = config or {}
self.unsafe_terms = self.config.get("unsafe_terms", [
"violence", "harmful", "dangerous"
])

def evaluate(self, test_case, response):
"""Check if response contains unsafe content"""
content = response.get("content", "").lower()

# Check for unsafe terms
found_terms = [
term for term in self.unsafe_terms
if term in content
]

is_safe = len(found_terms) == 0

return EvaluationResult(
evaluator_type="safety",
passed=is_safe,
score=1.0 if is_safe else 0.0,
reason="Safe content" if is_safe else f"Unsafe terms found: {', '.join(found_terms)}",
metadata={"unsafe_terms_found": found_terms}
)

Key Components

BaseEvaluator Interface:

class BaseEvaluator(ABC):
@abstractmethod
def evaluate(self, test_case, response) -> EvaluationResult:
pass

EvaluationResult Structure:

EvaluationResult(
evaluator_type="safety", # Evaluator name
passed=True, # Boolean pass/fail
score=1.0, # Numeric score (0-1)
reason="Explanation", # Human-readable explanation
metadata={"key": "value"} # Additional data (optional)
)

Configuration

config.yaml

dataset:
loader: local_file
paths:
- ./sample.evalset.json

providers:
- type: gemini
agent_id: test_agent
model: gemini-2.0-flash-exp

evaluators:
# Built-in evaluator
- type: response_evaluator
config:
similarity_threshold: 0.7

# Custom evaluator
- type: custom
module_path: ./evaluators/safety_evaluator.py
class_name: SafetyEvaluator
config:
unsafe_terms:
- violence
- harmful
- dangerous
- illegal

reporters:
- type: console

Running the Example

cd examples/03-custom-evaluator
judge-llm run --config config.yaml

Expected Output

Starting evaluation...

Evaluation Progress:
test_001: ✓ PASSED (cost: $0.0012, time: 1.2s)
✓ response_evaluator: Response is correct (score: 0.85)
✓ safety: Safe content

test_002: ✗ FAILED (cost: $0.0010, time: 1.0s)
✓ response_evaluator: Response is correct (score: 0.92)
✗ safety: Unsafe terms found: violence

Summary:
Total Tests: 2
Passed: 1
Failed: 1
Success Rate: 50.0%

Use Cases and Examples

1. Domain-Specific Validation

Math Format Evaluator:

class MathFormatEvaluator(BaseEvaluator):
def evaluate(self, test_case, response):
content = response.get("content", "")

# Check if answer is in correct format
has_steps = "Step 1:" in content
has_answer = "Answer:" in content
has_explanation = "Explanation:" in content

passed = has_steps and has_answer
score = sum([has_steps, has_answer, has_explanation]) / 3

return EvaluationResult(
evaluator_type="math_format",
passed=passed,
score=score,
reason="Correct format" if passed else "Missing required sections",
metadata={
"has_steps": has_steps,
"has_answer": has_answer,
"has_explanation": has_explanation
}
)

Usage:

evaluators:
- type: custom
module_path: ./evaluators/math_format.py
class_name: MathFormatEvaluator

2. Regex Pattern Matching

Email Validation Evaluator:

import re

class EmailValidationEvaluator(BaseEvaluator):
def __init__(self, config=None):
self.config = config or {}
self.pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

def evaluate(self, test_case, response):
content = response.get("content", "")

emails = re.findall(self.pattern, content)
has_valid_email = len(emails) > 0

return EvaluationResult(
evaluator_type="email_validation",
passed=has_valid_email,
score=1.0 if has_valid_email else 0.0,
reason=f"Found {len(emails)} valid email(s)" if has_valid_email else "No valid email found",
metadata={"emails_found": emails}
)

3. Length Constraints

Length Evaluator:

class LengthEvaluator(BaseEvaluator):
def __init__(self, config=None):
self.config = config or {}
self.min_length = self.config.get("min_length", 0)
self.max_length = self.config.get("max_length", 1000)

def evaluate(self, test_case, response):
content = response.get("content", "")
length = len(content)

passed = self.min_length <= length <= self.max_length

# Calculate score based on how close to optimal range
if passed:
score = 1.0
elif length < self.min_length:
score = length / self.min_length
else: # length > max_length
score = max(0, 1.0 - (length - self.max_length) / self.max_length)

return EvaluationResult(
evaluator_type="length",
passed=passed,
score=score,
reason=f"Length: {length} (valid range: {self.min_length}-{self.max_length})",
metadata={
"length": length,
"min_required": self.min_length,
"max_allowed": self.max_length
}
)

Usage:

evaluators:
- type: custom
module_path: ./evaluators/length.py
class_name: LengthEvaluator
config:
min_length: 50
max_length: 200

4. JSON Structure Validation

JSON Format Evaluator:

import json

class JSONFormatEvaluator(BaseEvaluator):
def __init__(self, config=None):
self.config = config or {}
self.required_fields = self.config.get("required_fields", [])

def evaluate(self, test_case, response):
content = response.get("content", "")

try:
data = json.loads(content)

# Check required fields
missing_fields = [
field for field in self.required_fields
if field not in data
]

passed = len(missing_fields) == 0
score = 1.0 - (len(missing_fields) / len(self.required_fields))

return EvaluationResult(
evaluator_type="json_format",
passed=passed,
score=score,
reason="Valid JSON with all required fields" if passed else f"Missing fields: {missing_fields}",
metadata={
"is_valid_json": True,
"missing_fields": missing_fields
}
)
except json.JSONDecodeError as e:
return EvaluationResult(
evaluator_type="json_format",
passed=False,
score=0.0,
reason=f"Invalid JSON: {str(e)}",
metadata={"is_valid_json": False}
)

Registration Methods

Method 1: Config-Based (This Example)

evaluators:
- type: custom
module_path: ./evaluators/safety_evaluator.py
class_name: SafetyEvaluator
config:
unsafe_terms: [violence, harmful]

Pros: Declarative, no code changes Cons: Must specify full path each time

Method 2: Programmatic Registration

from judge_llm import evaluate, register_evaluator
from evaluators.safety_evaluator import SafetyEvaluator

# Register once
register_evaluator("safety", SafetyEvaluator)

# Use by name in Python API
report = evaluate(
dataset={"loader": "local_file", "paths": ["./tests.json"]},
providers=[{"type": "gemini", "agent_id": "test"}],
evaluators=[{"type": "safety"}]
)

Pros: Register once, use everywhere Cons: Requires code changes

Method 3: Default Config Registration

# .judge_llm.defaults.yaml
evaluators:
- type: custom
module_path: ./evaluators/safety_evaluator.py
class_name: SafetyEvaluator
register_as: safety # Register globally
config:
unsafe_terms: [violence, harmful]

Then use by name:

# config.yaml
evaluators:
- type: safety # Uses registered evaluator

Pros: Best of both worlds - declarative and reusable Cons: Must be in defaults file

Best Practices

1. Clear Return Values

return EvaluationResult(
evaluator_type="my_evaluator",
passed=True,
score=1.0,
reason="Clear explanation of why it passed or failed"
)

2. Meaningful Metadata

return EvaluationResult(
evaluator_type="length",
passed=passed,
score=score,
reason=f"Length check: {length}",
metadata={
"actual_length": length,
"min_required": self.min_length,
"max_allowed": self.max_length,
"characters_over": max(0, length - self.max_length)
}
)

3. Error Handling

def evaluate(self, test_case, response):
try:
# Your evaluation logic
result = self._check_something(response)
return EvaluationResult(
evaluator_type="my_evaluator",
passed=result.is_valid,
score=result.score,
reason=result.message
)
except Exception as e:
return EvaluationResult(
evaluator_type="my_evaluator",
passed=False,
score=0.0,
reason=f"Evaluation error: {str(e)}",
metadata={"error": str(e)}
)

4. Configurable Behavior

class FlexibleEvaluator(BaseEvaluator):
def __init__(self, config=None):
self.config = config or {}
# Make behavior configurable
self.strict_mode = self.config.get("strict_mode", False)
self.threshold = self.config.get("threshold", 0.7)
self.ignore_case = self.config.get("ignore_case", True)

5. Reusable Components

class BasePatternEvaluator(BaseEvaluator):
"""Base class for pattern-based evaluators"""

def __init__(self, config=None):
self.config = config or {}
self.patterns = self.config.get("patterns", [])

def find_patterns(self, text):
import re
found = []
for pattern in self.patterns:
matches = re.findall(pattern, text)
found.extend(matches)
return found

# Extend for specific use cases
class EmailEvaluator(BasePatternEvaluator):
def __init__(self, config=None):
super().__init__(config)
if not self.patterns:
self.patterns = [r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b']

Testing Your Evaluator

# test_safety_evaluator.py
import pytest
from evaluators.safety_evaluator import SafetyEvaluator

def test_safe_content():
evaluator = SafetyEvaluator()
response = {"content": "This is safe content"}
result = evaluator.evaluate({}, response)

assert result.passed == True
assert result.score == 1.0
assert "Safe" in result.reason

def test_unsafe_content():
evaluator = SafetyEvaluator()
response = {"content": "This contains violence"}
result = evaluator.evaluate({}, response)

assert result.passed == False
assert result.score == 0.0
assert "violence" in result.reason
assert "violence" in result.metadata["unsafe_terms_found"]

def test_custom_unsafe_terms():
evaluator = SafetyEvaluator(config={
"unsafe_terms": ["forbidden", "blocked"]
})
response = {"content": "This has a forbidden word"}
result = evaluator.evaluate({}, response)

assert result.passed == False
assert "forbidden" in result.metadata["unsafe_terms_found"]

def test_case_insensitive():
evaluator = SafetyEvaluator(config={
"unsafe_terms": ["violence"]
})
response = {"content": "VIOLENCE in uppercase"}
result = evaluator.evaluate({}, response)

assert result.passed == False

Run tests:

pytest test_safety_evaluator.py

Combining Multiple Evaluators

evaluators:
# Built-in evaluators
- type: response_evaluator
- type: cost_evaluator
- type: latency_evaluator

# Custom evaluators
- type: custom
module_path: ./evaluators/safety.py
class_name: SafetyEvaluator

- type: custom
module_path: ./evaluators/length.py
class_name: LengthEvaluator
config:
min_length: 50
max_length: 500

- type: custom
module_path: ./evaluators/format.py
class_name: JSONFormatEvaluator
config:
required_fields: [name, email, message]

Next Steps

After mastering custom evaluators:

  1. Safety Evaluation - Multi-turn safety checks
  2. Config Override - Per-test evaluator configuration
  3. Database Tracking - Store and analyze evaluator results