Application Overview

A comprehensive guide to understanding Judge LLM's architecture, components, and design principles.

What is Judge LLM?

Judge LLM is a lightweight, extensible Python framework designed to systematically evaluate and compare Large Language Model (LLM) providers. It provides a structured approach to testing AI agents, measuring performance, tracking costs, and ensuring quality before production deployment.

Core Purpose

Systematic Testing: Run repeatable, version-controlled test suites against your AI agents
Provider Comparison: A/B test different LLM providers (Gemini, OpenAI, Anthropic, etc.)
Quality Assurance: Validate response quality, latency, costs, and safety before deployment
Regression Prevention: Catch performance degradations when models or code change
Cost Optimization: Track and optimize API costs across different providers and models

Architecture Overview

Judge LLM follows a modular, registry-based architecture with clear separation of concerns:

graph TB
    subgraph "Judge LLM Framework"
        subgraph "Component Layer"
            P[Providers<br/>Gemini, OpenAI, Mock, Custom]
            E[Evaluators<br/>Response, Trajectory, Cost, Latency]
            R[Reporters<br/>Console, HTML, JSON, Database]
        end

        subgraph "Core Layer"
            REG[Registry Core<br/>Component Registration<br/>Lifecycle Management]
            CONFIG[Config Loader<br/>YAML/JSON Parse<br/>Env Variables<br/>Validation]
            EVAL_ENG[Evaluator Engine<br/>Execute & Collect]
            REP_ENG[Reporter Engine<br/>Format & Output]
        end

        subgraph "Data Layer"
            DS[Dataset Loader<br/>JSON/YAML Support<br/>Local File/Directory]
            DB[(SQLite Database<br/>Historical Results)]
        end

        P --> REG
        E --> REG
        R --> REG

        CONFIG --> REG
        DS --> REG

        REG --> EVAL_ENG
        REG --> REP_ENG

        EVAL_ENG --> REP_ENG
        REP_ENG --> DB
    end

    USER[User/CLI] --> CONFIG
    CONFIG --> DS

    style P fill:#e1f5ff
    style E fill:#e1f5ff
    style R fill:#e1f5ff
    style REG fill:#fff4e1
    style CONFIG fill:#fff4e1
    style EVAL_ENG fill:#fff4e1
    style REP_ENG fill:#fff4e1
    style DS fill:#f0f0f0
    style DB fill:#f0f0f0

Key Components

1. Providers

Purpose: Abstract away different LLM APIs into a unified interface.

Built-in Providers:

Gemini - Google's Gemini API (Flash, Pro models)
Mock - Testing provider that returns expected responses without API calls
Custom - Extend for OpenAI, Anthropic, Azure, or any LLM API

Key Responsibilities:

Send prompts to LLM APIs
Handle authentication and rate limiting
Track token usage and costs
Return standardized response format
Support conversation history/context

Example:

from judge_llm.providers.base import BaseProvider

class MyProvider(BaseProvider):
    def invoke(self, messages, config):
        # Call your LLM API
        response = my_api.generate(messages)
        return {
            "content": response.text,
            "cost": response.usage.cost,
            "tokens": response.usage.tokens
        }

2. Evaluators

Purpose: Assess the quality and characteristics of LLM responses.

Built-in Evaluators:

Response Evaluator - Semantic similarity, exact matching, ROUGE scores
Trajectory Evaluator - Validates conversation flow and tool usage
Cost Evaluator - Monitors API costs against budgets
Latency Evaluator - Tracks response times and timeouts
Custom Evaluators - Implement domain-specific validation

Key Responsibilities:

Compare actual vs expected responses
Calculate similarity/quality scores
Validate conversation trajectories
Monitor performance metrics
Support per-test configuration overrides

Example:

from judge_llm.evaluators.base import BaseEvaluator

class SafetyEvaluator(BaseEvaluator):
    def evaluate(self, test_case, response):
        is_safe = self.check_safety(response["content"])
        return EvaluationResult(
            evaluator_type="safety",
            passed=is_safe,
            score=1.0 if is_safe else 0.0,
            reason="Safe content" if is_safe else "Unsafe content detected"
        )

3. Reporters

Purpose: Format and output evaluation results for different use cases.

Built-in Reporters:

Console - Real-time terminal output with colored formatting
HTML - Interactive dashboard with charts and tables
JSON - Structured data for programmatic access
Database - SQLite storage for historical tracking and trends
Custom Reporters - CSV, Slack notifications, custom dashboards

Key Responsibilities:

Format evaluation results
Generate visualizations
Store historical data
Enable trend analysis
Support multiple output formats simultaneously

Example:

from judge_llm.reporters.base import BaseReporter

class SlackReporter(BaseReporter):
    def report(self, evaluation_results):
        message = self.format_slack_message(results)
        self.slack_client.post_message(message)

4. Registry System

Purpose: Central component registration and lifecycle management.

Features:

Component Registration - Register providers, evaluators, reporters by name
Lazy Loading - Components instantiated only when needed
Configuration Binding - Automatically inject configuration into components
Lifecycle Management - Handle setup, execution, and cleanup
Type Safety - Validate component types at registration

Example:

from judge_llm.core.registry import Registry

# Register custom components
Registry.register_provider("my_provider", MyProvider)
Registry.register_evaluator("safety", SafetyEvaluator)

# Use by name in configuration
providers:
  - type: my_provider
evaluators:
  - type: safety

5. Configuration System

Purpose: Flexible, hierarchical configuration management.

Configuration Sources (in precedence order):

Test Config (config.yaml) - Specific test settings
Project Defaults (.judge_llm.defaults.yaml) - Project-wide defaults
Global Defaults (~/.judge_llm/defaults.yaml) - User defaults
Built-in Defaults - Framework defaults

Key Features:

Deep Merging - Intelligently combine configurations
Environment Variables - ${VAR_NAME:-default} syntax
Validation - Schema validation before execution
Per-Test Overrides - Override evaluator settings per test case

Example:

# .judge_llm.defaults.yaml (project defaults)
providers:
  - type: gemini
    model: gemini-2.0-flash-exp
    temperature: 0.7

evaluators:
  - type: response_evaluator
  - type: cost_evaluator

# config.yaml (test-specific)
dataset:
  loader: local_file
  paths: [./tests.json]

providers:
  - agent_id: my_agent  # Inherits type, model, temperature from defaults

Data Flow

Evaluation Execution Flow

sequenceDiagram
    participant User
    participant CLI
    participant ConfigLoader
    participant Registry
    participant DataLoader
    participant Provider
    participant Evaluator
    participant Reporter

    User->>CLI: judge-llm run --config config.yaml
    CLI->>ConfigLoader: Load configuration

    ConfigLoader->>ConfigLoader: Load config.yaml
    ConfigLoader->>ConfigLoader: Load .judge_llm.defaults.yaml
    ConfigLoader->>ConfigLoader: Load .env variables
    ConfigLoader->>ConfigLoader: Merge & validate configs
    ConfigLoader-->>CLI: Configuration ready

    CLI->>Registry: Register components
    Registry->>Registry: Register providers
    Registry->>Registry: Register evaluators
    Registry->>Registry: Register reporters
    Registry-->>CLI: Components registered

    CLI->>DataLoader: Load datasets
    DataLoader->>DataLoader: Load JSON/YAML files
    DataLoader->>DataLoader: Parse evalsets
    DataLoader-->>CLI: Test cases loaded

    loop For each test case
        CLI->>Provider: Execute test case
        Provider->>Provider: Send to LLM API
        Provider-->>CLI: Response received

        CLI->>Evaluator: Run evaluators
        Evaluator->>Evaluator: Response evaluation
        Evaluator->>Evaluator: Trajectory evaluation
        Evaluator->>Evaluator: Cost evaluation
        Evaluator->>Evaluator: Latency evaluation
        Evaluator-->>CLI: Evaluation results
    end

    CLI->>Reporter: Generate reports
    Reporter->>Reporter: Console output
    Reporter->>Reporter: HTML dashboard
    Reporter->>Reporter: JSON export
    Reporter->>Reporter: Database storage
    Reporter-->>User: Reports generated

    CLI->>CLI: Cleanup resources
    CLI-->>User: Evaluation complete

Test Case Structure

{
  "eval_set_id": "test_suite_v1",
  "name": "Test Suite Name",
  "description": "Suite description",
  "eval_cases": [
    {
      "eval_id": "test_001",
      "conversation": [
        {
          "invocation_id": "inv_1",
          "user_content": {
            "parts": [{"text": "User prompt"}],
            "role": "user"
          },
          "final_response": {
            "parts": [{"text": "Expected response"}]
          }
        }
      ],
      "session_input": {
        "user_prompt": "User prompt",
        "system_instruction": "System prompt"
      },
      "evaluator_config": {
        "ResponseEvaluator": {
          "similarity_threshold": 0.85
        }
      }
    }
  ]
}

Design Principles

1. Extensibility First

Every core component (Provider, Evaluator, Reporter) can be extended:

classDiagram
    class BaseProvider {
        <<abstract>>
        +invoke(messages, config)
        +get_cost()
        +cleanup()
    }

    class BaseEvaluator {
        <<abstract>>
        +evaluate(test_case, response)
        +get_score()
    }

    class BaseReporter {
        <<abstract>>
        +report(results)
        +format_output()
    }

    class GeminiProvider {
        +invoke(messages, config)
        +get_cost()
    }

    class CustomProvider {
        +invoke(messages, config)
        +get_cost()
    }

    class ResponseEvaluator {
        +evaluate(test_case, response)
        +calculate_similarity()
    }

    class CustomEvaluator {
        +evaluate(test_case, response)
        +custom_logic()
    }

    class HTMLReporter {
        +report(results)
        +generate_dashboard()
    }

    class CustomReporter {
        +report(results)
        +send_notification()
    }

    BaseProvider <|-- GeminiProvider
    BaseProvider <|-- CustomProvider
    BaseEvaluator <|-- ResponseEvaluator
    BaseEvaluator <|-- CustomEvaluator
    BaseReporter <|-- HTMLReporter
    BaseReporter <|-- CustomReporter

    note for BaseProvider "Extend to add\nnew LLM providers"
    note for BaseEvaluator "Extend to add\ncustom evaluation logic"
    note for BaseReporter "Extend to add\ncustom reporting"

Example implementation:

# Extend any base class
from judge_llm.providers.base import BaseProvider
from judge_llm.evaluators.base import BaseEvaluator
from judge_llm.reporters.base import BaseReporter

class MyComponent(Base*):
    def __init__(self, config):
        # Your initialization
        pass

    def method(self):
        # Your implementation
        pass

2. Configuration Over Code

Prefer declarative YAML configuration over imperative code:

# config.yaml
providers:
  - type: gemini
    model: gemini-2.0-flash-exp

evaluators:
  - type: response_evaluator
    config:
      similarity_threshold: 0.8

3. Convention Over Configuration

Sensible defaults minimize required configuration:

# Minimal config - uses built-in defaults
dataset:
  loader: local_file
  paths: [./tests.json]

providers:
  - type: gemini
    agent_id: my_agent

4. Composability

Mix and match components freely:

providers:
  - type: gemini
  - type: openai
  - type: custom
    module_path: ./my_provider.py

evaluators:
  - type: response_evaluator
  - type: cost_evaluator
  - type: custom
    module_path: ./my_evaluator.py

reporters:
  - type: console
  - type: html
  - type: database

5. Testability

Framework designed for easy testing:

Mock Provider - Test without API calls
Isolated Components - Unit test each component
Dependency Injection - Easy to mock dependencies
Deterministic - Consistent results with same inputs

Use Cases

1. Regression Testing

Scenario: Ensure new model versions don't degrade quality

# tests/regression_suite.yaml
dataset:
  loader: local_file
  paths: [./regression_tests.json]

providers:
  - type: gemini
    agent_id: production_agent
    model: ${MODEL_VERSION}

evaluators:
  - type: response_evaluator
    config:
      similarity_threshold: 0.85

reporters:
  - type: database
    db_path: ./results.db

Run before/after model updates:

MODEL_VERSION=gemini-1.5-flash judge-llm run --config tests/regression_suite.yaml
MODEL_VERSION=gemini-2.0-flash judge-llm run --config tests/regression_suite.yaml

2. A/B Testing Providers

Scenario: Compare Gemini vs OpenAI vs Anthropic

providers:
  - type: gemini
    agent_id: test_agent
    model: gemini-2.0-flash-exp

  - type: openai
    agent_id: test_agent
    model: gpt-4

  - type: anthropic
    agent_id: test_agent
    model: claude-3-sonnet

reporters:
  - type: html
    output_path: ./comparison.html

Framework automatically tests all providers and compares results.

3. Cost Optimization

Scenario: Find the cheapest model meeting quality requirements

providers:
  - type: gemini
    model: gemini-1.5-flash    # Cheapest
  - type: gemini
    model: gemini-1.5-pro      # More capable
  - type: gemini
    model: gemini-2.0-flash    # Latest

evaluators:
  - type: response_evaluator
    config:
      similarity_threshold: 0.8  # Minimum quality

  - type: cost_evaluator
    config:
      max_cost_per_case: 0.05

reporters:
  - type: database

Analyze results to find optimal price/performance ratio.

4. CI/CD Integration

Scenario: Automated testing in deployment pipeline

# .github/workflows/test.yml
name: LLM Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Install Judge LLM
        run: pip install judge-llm
      - name: Run Tests
        env:
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
        run: judge-llm run --config tests/ci_suite.yaml
      - name: Check Results
        run: |
          if [ $? -ne 0 ]; then
            echo "Tests failed!"
            exit 1
          fi

5. Safety Validation

Scenario: Validate responses don't contain harmful content

# evaluators/safety_evaluator.py
from judge_llm.evaluators.base import BaseEvaluator

class SafetyEvaluator(BaseEvaluator):
    def evaluate(self, test_case, response):
        # Check for PII, toxicity, harmful instructions
        issues = []
        content = response["content"]

        if self.contains_pii(content):
            issues.append("PII detected")
        if self.is_toxic(content):
            issues.append("Toxic content")

        return EvaluationResult(
            passed=len(issues) == 0,
            reason="Safe" if not issues else f"Issues: {', '.join(issues)}"
        )

# config.yaml
evaluators:
  - type: custom
    module_path: ./evaluators/safety_evaluator.py
    class_name: SafetyEvaluator

Performance Considerations

Parallel Execution

Run multiple test cases concurrently:

agent:
  parallel_execution: true
  max_workers: 5

Caching

Mock provider caches responses for development:

providers:
  - type: mock
    cache_responses: true

Database Optimization

Index frequently queried fields:

CREATE INDEX idx_eval_case_id ON execution_runs(eval_case_id);
CREATE INDEX idx_generated_at ON reports(generated_at);

Best Practices

1. Start with Mock Provider

Develop test cases without API costs:

providers:
  - type: mock

2. Use Default Configurations

Share common settings across tests:

# .judge_llm.defaults.yaml
providers:
  - type: gemini
    model: gemini-2.0-flash-exp

evaluators:
  - type: response_evaluator
  - type: cost_evaluator

3. Version Control Everything

project/
├── .judge_llm.defaults.yaml    # Project defaults
├── tests/
│   ├── regression_suite.yaml   # Test configs
│   ├── regression_tests.json   # Test cases
│   └── safety_tests.json
├── evaluators/                  # Custom evaluators
└── .env.example                # Environment template

4. Monitor Costs

Use database reporter to track spending:

sqlite3 results.db "
  SELECT
    DATE(generated_at) as date,
    SUM(total_cost) as daily_cost
  FROM reports
  GROUP BY DATE(generated_at)
  ORDER BY date DESC
"

5. Incremental Testing

Build up test suites gradually:

Start with basic happy path tests
Add edge cases
Add error scenarios
Add performance benchmarks
Add safety validations

Next Steps

Quick Start - Get started in 5 minutes
Configuration Guide - Deep dive into configuration
Examples - Learn by example
Custom Components - Extend the framework
Python API - Programmatic usage

What is Judge LLM?​

Core Purpose​

Architecture Overview​

Key Components​

1. Providers​

2. Evaluators​

3. Reporters​

4. Registry System​

5. Configuration System​

Data Flow​

Evaluation Execution Flow​

Test Case Structure​

Design Principles​

1. Extensibility First​

2. Configuration Over Code​

3. Convention Over Configuration​

4. Composability​

5. Testability​

Use Cases​

1. Regression Testing​

2. A/B Testing Providers​

3. Cost Optimization​

4. CI/CD Integration​

5. Safety Validation​

Performance Considerations​

Parallel Execution​

Caching​

Database Optimization​

Best Practices​

1. Start with Mock Provider​

2. Use Default Configurations​

3. Version Control Everything​

4. Monitor Costs​

5. Incremental Testing​

Next Steps​

What is Judge LLM?

Core Purpose

Architecture Overview

Key Components

1. Providers

2. Evaluators

3. Reporters

4. Registry System

5. Configuration System

Data Flow

Evaluation Execution Flow

Test Case Structure

Design Principles

1. Extensibility First

2. Configuration Over Code

3. Convention Over Configuration

4. Composability

5. Testability

Use Cases

1. Regression Testing

2. A/B Testing Providers

3. Cost Optimization

4. CI/CD Integration

5. Safety Validation

Performance Considerations

Parallel Execution

Caching

Database Optimization

Best Practices

1. Start with Mock Provider

2. Use Default Configurations

3. Version Control Everything

4. Monitor Costs

5. Incremental Testing

Next Steps