Skip to main content

Application Overview

A comprehensive guide to understanding Judge LLM's architecture, components, and design principles.

What is Judge LLM?

Judge LLM is a lightweight, extensible Python framework designed to systematically evaluate and compare Large Language Model (LLM) providers. It provides a structured approach to testing AI agents, measuring performance, tracking costs, and ensuring quality before production deployment.

Core Purpose

  • Systematic Testing: Run repeatable, version-controlled test suites against your AI agents
  • Provider Comparison: A/B test different LLM providers (Gemini, OpenAI, Anthropic, etc.)
  • Quality Assurance: Validate response quality, latency, costs, and safety before deployment
  • Regression Prevention: Catch performance degradations when models or code change
  • Cost Optimization: Track and optimize API costs across different providers and models

Architecture Overview

Judge LLM follows a modular, registry-based architecture with clear separation of concerns:

graph TB
subgraph "Judge LLM Framework"
subgraph "Component Layer"
P[Providers<br/>Gemini, OpenAI, Mock, Custom]
E[Evaluators<br/>Response, Trajectory, Cost, Latency]
R[Reporters<br/>Console, HTML, JSON, Database]
end

subgraph "Core Layer"
REG[Registry Core<br/>Component Registration<br/>Lifecycle Management]
CONFIG[Config Loader<br/>YAML/JSON Parse<br/>Env Variables<br/>Validation]
EVAL_ENG[Evaluator Engine<br/>Execute & Collect]
REP_ENG[Reporter Engine<br/>Format & Output]
end

subgraph "Data Layer"
DS[Dataset Loader<br/>JSON/YAML Support<br/>Local File/Directory]
DB[(SQLite Database<br/>Historical Results)]
end

P --> REG
E --> REG
R --> REG

CONFIG --> REG
DS --> REG

REG --> EVAL_ENG
REG --> REP_ENG

EVAL_ENG --> REP_ENG
REP_ENG --> DB
end

USER[User/CLI] --> CONFIG
CONFIG --> DS

style P fill:#e1f5ff
style E fill:#e1f5ff
style R fill:#e1f5ff
style REG fill:#fff4e1
style CONFIG fill:#fff4e1
style EVAL_ENG fill:#fff4e1
style REP_ENG fill:#fff4e1
style DS fill:#f0f0f0
style DB fill:#f0f0f0

Key Components

1. Providers

Purpose: Abstract away different LLM APIs into a unified interface.

Built-in Providers:

  • Gemini - Google's Gemini API (Flash, Pro models)
  • Mock - Testing provider that returns expected responses without API calls
  • Custom - Extend for OpenAI, Anthropic, Azure, or any LLM API

Key Responsibilities:

  • Send prompts to LLM APIs
  • Handle authentication and rate limiting
  • Track token usage and costs
  • Return standardized response format
  • Support conversation history/context

Example:

from judge_llm.providers.base import BaseProvider

class MyProvider(BaseProvider):
def invoke(self, messages, config):
# Call your LLM API
response = my_api.generate(messages)
return {
"content": response.text,
"cost": response.usage.cost,
"tokens": response.usage.tokens
}

2. Evaluators

Purpose: Assess the quality and characteristics of LLM responses.

Built-in Evaluators:

  • Response Evaluator - Semantic similarity, exact matching, ROUGE scores
  • Trajectory Evaluator - Validates conversation flow and tool usage
  • Cost Evaluator - Monitors API costs against budgets
  • Latency Evaluator - Tracks response times and timeouts
  • Custom Evaluators - Implement domain-specific validation

Key Responsibilities:

  • Compare actual vs expected responses
  • Calculate similarity/quality scores
  • Validate conversation trajectories
  • Monitor performance metrics
  • Support per-test configuration overrides

Example:

from judge_llm.evaluators.base import BaseEvaluator

class SafetyEvaluator(BaseEvaluator):
def evaluate(self, test_case, response):
is_safe = self.check_safety(response["content"])
return EvaluationResult(
evaluator_type="safety",
passed=is_safe,
score=1.0 if is_safe else 0.0,
reason="Safe content" if is_safe else "Unsafe content detected"
)

3. Reporters

Purpose: Format and output evaluation results for different use cases.

Built-in Reporters:

  • Console - Real-time terminal output with colored formatting
  • HTML - Interactive dashboard with charts and tables
  • JSON - Structured data for programmatic access
  • Database - SQLite storage for historical tracking and trends
  • Custom Reporters - CSV, Slack notifications, custom dashboards

Key Responsibilities:

  • Format evaluation results
  • Generate visualizations
  • Store historical data
  • Enable trend analysis
  • Support multiple output formats simultaneously

Example:

from judge_llm.reporters.base import BaseReporter

class SlackReporter(BaseReporter):
def report(self, evaluation_results):
message = self.format_slack_message(results)
self.slack_client.post_message(message)

4. Registry System

Purpose: Central component registration and lifecycle management.

Features:

  • Component Registration - Register providers, evaluators, reporters by name
  • Lazy Loading - Components instantiated only when needed
  • Configuration Binding - Automatically inject configuration into components
  • Lifecycle Management - Handle setup, execution, and cleanup
  • Type Safety - Validate component types at registration

Example:

from judge_llm.core.registry import Registry

# Register custom components
Registry.register_provider("my_provider", MyProvider)
Registry.register_evaluator("safety", SafetyEvaluator)

# Use by name in configuration
providers:
- type: my_provider
evaluators:
- type: safety

5. Configuration System

Purpose: Flexible, hierarchical configuration management.

Configuration Sources (in precedence order):

  1. Test Config (config.yaml) - Specific test settings
  2. Project Defaults (.judge_llm.defaults.yaml) - Project-wide defaults
  3. Global Defaults (~/.judge_llm/defaults.yaml) - User defaults
  4. Built-in Defaults - Framework defaults

Key Features:

  • Deep Merging - Intelligently combine configurations
  • Environment Variables - ${VAR_NAME:-default} syntax
  • Validation - Schema validation before execution
  • Per-Test Overrides - Override evaluator settings per test case

Example:

# .judge_llm.defaults.yaml (project defaults)
providers:
- type: gemini
model: gemini-2.0-flash-exp
temperature: 0.7

evaluators:
- type: response_evaluator
- type: cost_evaluator

# config.yaml (test-specific)
dataset:
loader: local_file
paths: [./tests.json]

providers:
- agent_id: my_agent # Inherits type, model, temperature from defaults

Data Flow

Evaluation Execution Flow

sequenceDiagram
participant User
participant CLI
participant ConfigLoader
participant Registry
participant DataLoader
participant Provider
participant Evaluator
participant Reporter

User->>CLI: judge-llm run --config config.yaml
CLI->>ConfigLoader: Load configuration

ConfigLoader->>ConfigLoader: Load config.yaml
ConfigLoader->>ConfigLoader: Load .judge_llm.defaults.yaml
ConfigLoader->>ConfigLoader: Load .env variables
ConfigLoader->>ConfigLoader: Merge & validate configs
ConfigLoader-->>CLI: Configuration ready

CLI->>Registry: Register components
Registry->>Registry: Register providers
Registry->>Registry: Register evaluators
Registry->>Registry: Register reporters
Registry-->>CLI: Components registered

CLI->>DataLoader: Load datasets
DataLoader->>DataLoader: Load JSON/YAML files
DataLoader->>DataLoader: Parse evalsets
DataLoader-->>CLI: Test cases loaded

loop For each test case
CLI->>Provider: Execute test case
Provider->>Provider: Send to LLM API
Provider-->>CLI: Response received

CLI->>Evaluator: Run evaluators
Evaluator->>Evaluator: Response evaluation
Evaluator->>Evaluator: Trajectory evaluation
Evaluator->>Evaluator: Cost evaluation
Evaluator->>Evaluator: Latency evaluation
Evaluator-->>CLI: Evaluation results
end

CLI->>Reporter: Generate reports
Reporter->>Reporter: Console output
Reporter->>Reporter: HTML dashboard
Reporter->>Reporter: JSON export
Reporter->>Reporter: Database storage
Reporter-->>User: Reports generated

CLI->>CLI: Cleanup resources
CLI-->>User: Evaluation complete

Test Case Structure

{
"eval_set_id": "test_suite_v1",
"name": "Test Suite Name",
"description": "Suite description",
"eval_cases": [
{
"eval_id": "test_001",
"conversation": [
{
"invocation_id": "inv_1",
"user_content": {
"parts": [{"text": "User prompt"}],
"role": "user"
},
"final_response": {
"parts": [{"text": "Expected response"}]
}
}
],
"session_input": {
"user_prompt": "User prompt",
"system_instruction": "System prompt"
},
"evaluator_config": {
"ResponseEvaluator": {
"similarity_threshold": 0.85
}
}
}
]
}

Design Principles

1. Extensibility First

Every core component (Provider, Evaluator, Reporter) can be extended:

classDiagram
class BaseProvider {
<<abstract>>
+invoke(messages, config)
+get_cost()
+cleanup()
}

class BaseEvaluator {
<<abstract>>
+evaluate(test_case, response)
+get_score()
}

class BaseReporter {
<<abstract>>
+report(results)
+format_output()
}

class GeminiProvider {
+invoke(messages, config)
+get_cost()
}

class CustomProvider {
+invoke(messages, config)
+get_cost()
}

class ResponseEvaluator {
+evaluate(test_case, response)
+calculate_similarity()
}

class CustomEvaluator {
+evaluate(test_case, response)
+custom_logic()
}

class HTMLReporter {
+report(results)
+generate_dashboard()
}

class CustomReporter {
+report(results)
+send_notification()
}

BaseProvider <|-- GeminiProvider
BaseProvider <|-- CustomProvider
BaseEvaluator <|-- ResponseEvaluator
BaseEvaluator <|-- CustomEvaluator
BaseReporter <|-- HTMLReporter
BaseReporter <|-- CustomReporter

note for BaseProvider "Extend to add\nnew LLM providers"
note for BaseEvaluator "Extend to add\ncustom evaluation logic"
note for BaseReporter "Extend to add\ncustom reporting"

Example implementation:

# Extend any base class
from judge_llm.providers.base import BaseProvider
from judge_llm.evaluators.base import BaseEvaluator
from judge_llm.reporters.base import BaseReporter

class MyComponent(Base*):
def __init__(self, config):
# Your initialization
pass

def method(self):
# Your implementation
pass

2. Configuration Over Code

Prefer declarative YAML configuration over imperative code:

# config.yaml
providers:
- type: gemini
model: gemini-2.0-flash-exp

evaluators:
- type: response_evaluator
config:
similarity_threshold: 0.8

3. Convention Over Configuration

Sensible defaults minimize required configuration:

# Minimal config - uses built-in defaults
dataset:
loader: local_file
paths: [./tests.json]

providers:
- type: gemini
agent_id: my_agent

4. Composability

Mix and match components freely:

providers:
- type: gemini
- type: openai
- type: custom
module_path: ./my_provider.py

evaluators:
- type: response_evaluator
- type: cost_evaluator
- type: custom
module_path: ./my_evaluator.py

reporters:
- type: console
- type: html
- type: database

5. Testability

Framework designed for easy testing:

  • Mock Provider - Test without API calls
  • Isolated Components - Unit test each component
  • Dependency Injection - Easy to mock dependencies
  • Deterministic - Consistent results with same inputs

Use Cases

1. Regression Testing

Scenario: Ensure new model versions don't degrade quality

# tests/regression_suite.yaml
dataset:
loader: local_file
paths: [./regression_tests.json]

providers:
- type: gemini
agent_id: production_agent
model: ${MODEL_VERSION}

evaluators:
- type: response_evaluator
config:
similarity_threshold: 0.85

reporters:
- type: database
db_path: ./results.db

Run before/after model updates:

MODEL_VERSION=gemini-1.5-flash judge-llm run --config tests/regression_suite.yaml
MODEL_VERSION=gemini-2.0-flash judge-llm run --config tests/regression_suite.yaml

2. A/B Testing Providers

Scenario: Compare Gemini vs OpenAI vs Anthropic

providers:
- type: gemini
agent_id: test_agent
model: gemini-2.0-flash-exp

- type: openai
agent_id: test_agent
model: gpt-4

- type: anthropic
agent_id: test_agent
model: claude-3-sonnet

reporters:
- type: html
output_path: ./comparison.html

Framework automatically tests all providers and compares results.

3. Cost Optimization

Scenario: Find the cheapest model meeting quality requirements

providers:
- type: gemini
model: gemini-1.5-flash # Cheapest
- type: gemini
model: gemini-1.5-pro # More capable
- type: gemini
model: gemini-2.0-flash # Latest

evaluators:
- type: response_evaluator
config:
similarity_threshold: 0.8 # Minimum quality

- type: cost_evaluator
config:
max_cost_per_case: 0.05

reporters:
- type: database

Analyze results to find optimal price/performance ratio.

4. CI/CD Integration

Scenario: Automated testing in deployment pipeline

# .github/workflows/test.yml
name: LLM Tests
on: [push, pull_request]

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Install Judge LLM
run: pip install judge-llm
- name: Run Tests
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
run: judge-llm run --config tests/ci_suite.yaml
- name: Check Results
run: |
if [ $? -ne 0 ]; then
echo "Tests failed!"
exit 1
fi

5. Safety Validation

Scenario: Validate responses don't contain harmful content

# evaluators/safety_evaluator.py
from judge_llm.evaluators.base import BaseEvaluator

class SafetyEvaluator(BaseEvaluator):
def evaluate(self, test_case, response):
# Check for PII, toxicity, harmful instructions
issues = []
content = response["content"]

if self.contains_pii(content):
issues.append("PII detected")
if self.is_toxic(content):
issues.append("Toxic content")

return EvaluationResult(
passed=len(issues) == 0,
reason="Safe" if not issues else f"Issues: {', '.join(issues)}"
)
# config.yaml
evaluators:
- type: custom
module_path: ./evaluators/safety_evaluator.py
class_name: SafetyEvaluator

Performance Considerations

Parallel Execution

Run multiple test cases concurrently:

agent:
parallel_execution: true
max_workers: 5

Caching

Mock provider caches responses for development:

providers:
- type: mock
cache_responses: true

Database Optimization

Index frequently queried fields:

CREATE INDEX idx_eval_case_id ON execution_runs(eval_case_id);
CREATE INDEX idx_generated_at ON reports(generated_at);

Best Practices

1. Start with Mock Provider

Develop test cases without API costs:

providers:
- type: mock

2. Use Default Configurations

Share common settings across tests:

# .judge_llm.defaults.yaml
providers:
- type: gemini
model: gemini-2.0-flash-exp

evaluators:
- type: response_evaluator
- type: cost_evaluator

3. Version Control Everything

project/
├── .judge_llm.defaults.yaml # Project defaults
├── tests/
│ ├── regression_suite.yaml # Test configs
│ ├── regression_tests.json # Test cases
│ └── safety_tests.json
├── evaluators/ # Custom evaluators
└── .env.example # Environment template

4. Monitor Costs

Use database reporter to track spending:

sqlite3 results.db "
SELECT
DATE(generated_at) as date,
SUM(total_cost) as daily_cost
FROM reports
GROUP BY DATE(generated_at)
ORDER BY date DESC
"

5. Incremental Testing

Build up test suites gradually:

  1. Start with basic happy path tests
  2. Add edge cases
  3. Add error scenarios
  4. Add performance benchmarks
  5. Add safety validations

Next Steps