Application Overview
A comprehensive guide to understanding Judge LLM's architecture, components, and design principles.
What is Judge LLM?
Judge LLM is a lightweight, extensible Python framework designed to systematically evaluate and compare Large Language Model (LLM) providers. It provides a structured approach to testing AI agents, measuring performance, tracking costs, and ensuring quality before production deployment.
Core Purpose
- Systematic Testing: Run repeatable, version-controlled test suites against your AI agents
- Provider Comparison: A/B test different LLM providers (Gemini, OpenAI, Anthropic, etc.)
- Quality Assurance: Validate response quality, latency, costs, and safety before deployment
- Regression Prevention: Catch performance degradations when models or code change
- Cost Optimization: Track and optimize API costs across different providers and models
Architecture Overview
Judge LLM follows a modular, registry-based architecture with clear separation of concerns:
graph TB
subgraph "Judge LLM Framework"
subgraph "Component Layer"
P[Providers<br/>Gemini, OpenAI, Mock, Custom]
E[Evaluators<br/>Response, Trajectory, Cost, Latency]
R[Reporters<br/>Console, HTML, JSON, Database]
end
subgraph "Core Layer"
REG[Registry Core<br/>Component Registration<br/>Lifecycle Management]
CONFIG[Config Loader<br/>YAML/JSON Parse<br/>Env Variables<br/>Validation]
EVAL_ENG[Evaluator Engine<br/>Execute & Collect]
REP_ENG[Reporter Engine<br/>Format & Output]
end
subgraph "Data Layer"
DS[Dataset Loader<br/>JSON/YAML Support<br/>Local File/Directory]
DB[(SQLite Database<br/>Historical Results)]
end
P --> REG
E --> REG
R --> REG
CONFIG --> REG
DS --> REG
REG --> EVAL_ENG
REG --> REP_ENG
EVAL_ENG --> REP_ENG
REP_ENG --> DB
end
USER[User/CLI] --> CONFIG
CONFIG --> DS
style P fill:#e1f5ff
style E fill:#e1f5ff
style R fill:#e1f5ff
style REG fill:#fff4e1
style CONFIG fill:#fff4e1
style EVAL_ENG fill:#fff4e1
style REP_ENG fill:#fff4e1
style DS fill:#f0f0f0
style DB fill:#f0f0f0
Key Components
1. Providers
Purpose: Abstract away different LLM APIs into a unified interface.
Built-in Providers:
- Gemini - Google's Gemini API (Flash, Pro models)
- Mock - Testing provider that returns expected responses without API calls
- Custom - Extend for OpenAI, Anthropic, Azure, or any LLM API
Key Responsibilities:
- Send prompts to LLM APIs
- Handle authentication and rate limiting
- Track token usage and costs
- Return standardized response format
- Support conversation history/context
Example:
from judge_llm.providers.base import BaseProvider
class MyProvider(BaseProvider):
def invoke(self, messages, config):
# Call your LLM API
response = my_api.generate(messages)
return {
"content": response.text,
"cost": response.usage.cost,
"tokens": response.usage.tokens
}
2. Evaluators
Purpose: Assess the quality and characteristics of LLM responses.
Built-in Evaluators:
- Response Evaluator - Semantic similarity, exact matching, ROUGE scores
- Trajectory Evaluator - Validates conversation flow and tool usage
- Cost Evaluator - Monitors API costs against budgets
- Latency Evaluator - Tracks response times and timeouts
- Custom Evaluators - Implement domain-specific validation
Key Responsibilities:
- Compare actual vs expected responses
- Calculate similarity/quality scores
- Validate conversation trajectories
- Monitor performance metrics
- Support per-test configuration overrides
Example:
from judge_llm.evaluators.base import BaseEvaluator
class SafetyEvaluator(BaseEvaluator):
def evaluate(self, test_case, response):
is_safe = self.check_safety(response["content"])
return EvaluationResult(
evaluator_type="safety",
passed=is_safe,
score=1.0 if is_safe else 0.0,
reason="Safe content" if is_safe else "Unsafe content detected"
)
3. Reporters
Purpose: Format and output evaluation results for different use cases.
Built-in Reporters:
- Console - Real-time terminal output with colored formatting
- HTML - Interactive dashboard with charts and tables
- JSON - Structured data for programmatic access
- Database - SQLite storage for historical tracking and trends
- Custom Reporters - CSV, Slack notifications, custom dashboards
Key Responsibilities:
- Format evaluation results
- Generate visualizations
- Store historical data
- Enable trend analysis
- Support multiple output formats simultaneously
Example:
from judge_llm.reporters.base import BaseReporter
class SlackReporter(BaseReporter):
def report(self, evaluation_results):
message = self.format_slack_message(results)
self.slack_client.post_message(message)
4. Registry System
Purpose: Central component registration and lifecycle management.
Features:
- Component Registration - Register providers, evaluators, reporters by name
- Lazy Loading - Components instantiated only when needed
- Configuration Binding - Automatically inject configuration into components
- Lifecycle Management - Handle setup, execution, and cleanup
- Type Safety - Validate component types at registration
Example:
from judge_llm.core.registry import Registry
# Register custom components
Registry.register_provider("my_provider", MyProvider)
Registry.register_evaluator("safety", SafetyEvaluator)
# Use by name in configuration
providers:
- type: my_provider
evaluators:
- type: safety
5. Configuration System
Purpose: Flexible, hierarchical configuration management.
Configuration Sources (in precedence order):
- Test Config (
config.yaml) - Specific test settings - Project Defaults (
.judge_llm.defaults.yaml) - Project-wide defaults - Global Defaults (
~/.judge_llm/defaults.yaml) - User defaults - Built-in Defaults - Framework defaults
Key Features:
- Deep Merging - Intelligently combine configurations
- Environment Variables -
${VAR_NAME:-default}syntax - Validation - Schema validation before execution
- Per-Test Overrides - Override evaluator settings per test case
Example:
# .judge_llm.defaults.yaml (project defaults)
providers:
- type: gemini
model: gemini-2.0-flash-exp
temperature: 0.7
evaluators:
- type: response_evaluator
- type: cost_evaluator
# config.yaml (test-specific)
dataset:
loader: local_file
paths: [./tests.json]
providers:
- agent_id: my_agent # Inherits type, model, temperature from defaults
Data Flow
Evaluation Execution Flow
sequenceDiagram
participant User
participant CLI
participant ConfigLoader
participant Registry
participant DataLoader
participant Provider
participant Evaluator
participant Reporter
User->>CLI: judge-llm run --config config.yaml
CLI->>ConfigLoader: Load configuration
ConfigLoader->>ConfigLoader: Load config.yaml
ConfigLoader->>ConfigLoader: Load .judge_llm.defaults.yaml
ConfigLoader->>ConfigLoader: Load .env variables
ConfigLoader->>ConfigLoader: Merge & validate configs
ConfigLoader-->>CLI: Configuration ready
CLI->>Registry: Register components
Registry->>Registry: Register providers
Registry->>Registry: Register evaluators
Registry->>Registry: Register reporters
Registry-->>CLI: Components registered
CLI->>DataLoader: Load datasets
DataLoader->>DataLoader: Load JSON/YAML files
DataLoader->>DataLoader: Parse evalsets
DataLoader-->>CLI: Test cases loaded
loop For each test case
CLI->>Provider: Execute test case
Provider->>Provider: Send to LLM API
Provider-->>CLI: Response received
CLI->>Evaluator: Run evaluators
Evaluator->>Evaluator: Response evaluation
Evaluator->>Evaluator: Trajectory evaluation
Evaluator->>Evaluator: Cost evaluation
Evaluator->>Evaluator: Latency evaluation
Evaluator-->>CLI: Evaluation results
end
CLI->>Reporter: Generate reports
Reporter->>Reporter: Console output
Reporter->>Reporter: HTML dashboard
Reporter->>Reporter: JSON export
Reporter->>Reporter: Database storage
Reporter-->>User: Reports generated
CLI->>CLI: Cleanup resources
CLI-->>User: Evaluation complete
Test Case Structure
{
"eval_set_id": "test_suite_v1",
"name": "Test Suite Name",
"description": "Suite description",
"eval_cases": [
{
"eval_id": "test_001",
"conversation": [
{
"invocation_id": "inv_1",
"user_content": {
"parts": [{"text": "User prompt"}],
"role": "user"
},
"final_response": {
"parts": [{"text": "Expected response"}]
}
}
],
"session_input": {
"user_prompt": "User prompt",
"system_instruction": "System prompt"
},
"evaluator_config": {
"ResponseEvaluator": {
"similarity_threshold": 0.85
}
}
}
]
}
Design Principles
1. Extensibility First
Every core component (Provider, Evaluator, Reporter) can be extended:
classDiagram
class BaseProvider {
<<abstract>>
+invoke(messages, config)
+get_cost()
+cleanup()
}
class BaseEvaluator {
<<abstract>>
+evaluate(test_case, response)
+get_score()
}
class BaseReporter {
<<abstract>>
+report(results)
+format_output()
}
class GeminiProvider {
+invoke(messages, config)
+get_cost()
}
class CustomProvider {
+invoke(messages, config)
+get_cost()
}
class ResponseEvaluator {
+evaluate(test_case, response)
+calculate_similarity()
}
class CustomEvaluator {
+evaluate(test_case, response)
+custom_logic()
}
class HTMLReporter {
+report(results)
+generate_dashboard()
}
class CustomReporter {
+report(results)
+send_notification()
}
BaseProvider <|-- GeminiProvider
BaseProvider <|-- CustomProvider
BaseEvaluator <|-- ResponseEvaluator
BaseEvaluator <|-- CustomEvaluator
BaseReporter <|-- HTMLReporter
BaseReporter <|-- CustomReporter
note for BaseProvider "Extend to add\nnew LLM providers"
note for BaseEvaluator "Extend to add\ncustom evaluation logic"
note for BaseReporter "Extend to add\ncustom reporting"
Example implementation:
# Extend any base class
from judge_llm.providers.base import BaseProvider
from judge_llm.evaluators.base import BaseEvaluator
from judge_llm.reporters.base import BaseReporter
class MyComponent(Base*):
def __init__(self, config):
# Your initialization
pass
def method(self):
# Your implementation
pass
2. Configuration Over Code
Prefer declarative YAML configuration over imperative code:
# config.yaml
providers:
- type: gemini
model: gemini-2.0-flash-exp
evaluators:
- type: response_evaluator
config:
similarity_threshold: 0.8
3. Convention Over Configuration
Sensible defaults minimize required configuration:
# Minimal config - uses built-in defaults
dataset:
loader: local_file
paths: [./tests.json]
providers:
- type: gemini
agent_id: my_agent
4. Composability
Mix and match components freely:
providers:
- type: gemini
- type: openai
- type: custom
module_path: ./my_provider.py
evaluators:
- type: response_evaluator
- type: cost_evaluator
- type: custom
module_path: ./my_evaluator.py
reporters:
- type: console
- type: html
- type: database
5. Testability
Framework designed for easy testing:
- Mock Provider - Test without API calls
- Isolated Components - Unit test each component
- Dependency Injection - Easy to mock dependencies
- Deterministic - Consistent results with same inputs
Use Cases
1. Regression Testing
Scenario: Ensure new model versions don't degrade quality
# tests/regression_suite.yaml
dataset:
loader: local_file
paths: [./regression_tests.json]
providers:
- type: gemini
agent_id: production_agent
model: ${MODEL_VERSION}
evaluators:
- type: response_evaluator
config:
similarity_threshold: 0.85
reporters:
- type: database
db_path: ./results.db
Run before/after model updates:
MODEL_VERSION=gemini-1.5-flash judge-llm run --config tests/regression_suite.yaml
MODEL_VERSION=gemini-2.0-flash judge-llm run --config tests/regression_suite.yaml
2. A/B Testing Providers
Scenario: Compare Gemini vs OpenAI vs Anthropic
providers:
- type: gemini
agent_id: test_agent
model: gemini-2.0-flash-exp
- type: openai
agent_id: test_agent
model: gpt-4
- type: anthropic
agent_id: test_agent
model: claude-3-sonnet
reporters:
- type: html
output_path: ./comparison.html
Framework automatically tests all providers and compares results.
3. Cost Optimization
Scenario: Find the cheapest model meeting quality requirements
providers:
- type: gemini
model: gemini-1.5-flash # Cheapest
- type: gemini
model: gemini-1.5-pro # More capable
- type: gemini
model: gemini-2.0-flash # Latest
evaluators:
- type: response_evaluator
config:
similarity_threshold: 0.8 # Minimum quality
- type: cost_evaluator
config:
max_cost_per_case: 0.05
reporters:
- type: database
Analyze results to find optimal price/performance ratio.
4. CI/CD Integration
Scenario: Automated testing in deployment pipeline
# .github/workflows/test.yml
name: LLM Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Install Judge LLM
run: pip install judge-llm
- name: Run Tests
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
run: judge-llm run --config tests/ci_suite.yaml
- name: Check Results
run: |
if [ $? -ne 0 ]; then
echo "Tests failed!"
exit 1
fi
5. Safety Validation
Scenario: Validate responses don't contain harmful content
# evaluators/safety_evaluator.py
from judge_llm.evaluators.base import BaseEvaluator
class SafetyEvaluator(BaseEvaluator):
def evaluate(self, test_case, response):
# Check for PII, toxicity, harmful instructions
issues = []
content = response["content"]
if self.contains_pii(content):
issues.append("PII detected")
if self.is_toxic(content):
issues.append("Toxic content")
return EvaluationResult(
passed=len(issues) == 0,
reason="Safe" if not issues else f"Issues: {', '.join(issues)}"
)
# config.yaml
evaluators:
- type: custom
module_path: ./evaluators/safety_evaluator.py
class_name: SafetyEvaluator
Performance Considerations
Parallel Execution
Run multiple test cases concurrently:
agent:
parallel_execution: true
max_workers: 5
Caching
Mock provider caches responses for development:
providers:
- type: mock
cache_responses: true
Database Optimization
Index frequently queried fields:
CREATE INDEX idx_eval_case_id ON execution_runs(eval_case_id);
CREATE INDEX idx_generated_at ON reports(generated_at);
Best Practices
1. Start with Mock Provider
Develop test cases without API costs:
providers:
- type: mock
2. Use Default Configurations
Share common settings across tests:
# .judge_llm.defaults.yaml
providers:
- type: gemini
model: gemini-2.0-flash-exp
evaluators:
- type: response_evaluator
- type: cost_evaluator
3. Version Control Everything
project/
├── .judge_llm.defaults.yaml # Project defaults
├── tests/
│ ├── regression_suite.yaml # Test configs
│ ├── regression_tests.json # Test cases
│ └── safety_tests.json
├── evaluators/ # Custom evaluators
└── .env.example # Environment template
4. Monitor Costs
Use database reporter to track spending:
sqlite3 results.db "
SELECT
DATE(generated_at) as date,
SUM(total_cost) as daily_cost
FROM reports
GROUP BY DATE(generated_at)
ORDER BY date DESC
"
5. Incremental Testing
Build up test suites gradually:
- Start with basic happy path tests
- Add edge cases
- Add error scenarios
- Add performance benchmarks
- Add safety validations
Next Steps
- Quick Start - Get started in 5 minutes
- Configuration Guide - Deep dive into configuration
- Examples - Learn by example
- Custom Components - Extend the framework
- Python API - Programmatic usage