JSON Reporter

Export evaluation results as structured JSON for programmatic analysis, data pipelines, and tool integration.

Overview

The JSON Reporter outputs evaluation results as a structured JSON file. Perfect for automation, data analysis, and integration with other tools.

Key Features:

Machine-readable structured format
Complete evaluation data export
Easy to parse and process
Version control friendly
Schema-consistent output

Configuration

Basic Configuration

reporters:
  - type: json
    output_path: ./results.json

Python API

from judge_llm import evaluate

report = evaluate(
    dataset={"loader": "local_file", "paths": ["./tests.json"]},
    providers=[{"type": "gemini", "agent_id": "test"}],
    evaluators=[{"type": "response_evaluator"}],
    reporters=[{"type": "json", "output_path": "./results.json"}]
)

Output Format

Top-Level Structure

{
  "total_cost": 0.0234,
  "total_time": 12.45,
  "success_rate": 0.85,
  "overall_success": true,
  "summary": {
    "total_executions": 10,
    "successful_executions": 8,
    "failed_executions": 2
  },
  "test_cases": [...]
}

Test Case Structure

{
  "eval_id": "test_001",
  "agent_id": "my_agent",
  "provider_type": "gemini",
  "passed": true,
  "cost": 0.00234,
  "time_taken": 1.234,
  "evaluation_results": [...],
  "conversation_history": [...]
}

Use Cases

Programmatic Analysis

import json

# Load results
with open("results.json") as f:
    data = json.load(f)

# Analyze pass rate
print(f"Success rate: {data['success_rate'] * 100:.1f}%")

# Find expensive tests
expensive = [
    tc for tc in data['test_cases'] 
    if tc['cost'] > 0.01
]
print(f"Expensive tests: {len(expensive)}")

# Calculate average latency
avg_latency = sum(tc['time_taken'] for tc in data['test_cases']) / len(data['test_cases'])
print(f"Average latency: {avg_latency:.2f}s")

CI/CD Integration

# .github/workflows/eval.yml
- name: Run Evaluations
  run: judge-llm run --config test.yaml --report json --output results.json

- name: Analyze Results
  run: python analyze_results.py results.json

- name: Fail if Low Pass Rate
  run: |
    pass_rate=$(jq '.success_rate' results.json)
    if (( $(echo "$pass_rate < 0.9" | bc -l) )); then
      echo "Pass rate too low: $pass_rate"
      exit 1
    fi

Data Pipelines

# ETL pipeline
import json
import pandas as pd

# Load JSON results
with open("results.json") as f:
    data = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(data['test_cases'])

# Analyze
print(df.groupby('provider_type')['cost'].sum())
print(df[['eval_id', 'passed', 'time_taken']])

# Export to CSV
df.to_csv("analysis.csv", index=False)

Version Control

# Track results over time
git add results.json
git commit -m "Evaluation results for v2.0"

# Compare with previous
git diff HEAD~1 results.json

Best Practices

1. Timestamped Outputs

reporters:
  - type: json
    output_path: ./results/eval-{{timestamp}}.json

2. Combine with Other Reporters

reporters:
  - type: console    # Monitor progress
  - type: json       # Export for analysis
    output_path: ./results.json

3. Structured Storage

results/
  ├── 2025-01-15/
  │   ├── morning-eval.json
  │   └── afternoon-eval.json
  └── 2025-01-16/
      └── daily-eval.json

4. Schema Validation

import json
import jsonschema

# Define expected schema
schema = {
    "type": "object",
    "required": ["total_cost", "success_rate", "test_cases"],
    "properties": {
        "success_rate": {"type": "number", "minimum": 0, "maximum": 1}
    }
}

# Validate
with open("results.json") as f:
    data = json.load(f)
jsonschema.validate(data, schema)

Overview​

Configuration​

Basic Configuration​

Python API​

Output Format​

Top-Level Structure​

Test Case Structure​

Use Cases​

Programmatic Analysis​

CI/CD Integration​

Data Pipelines​

Version Control​

Best Practices​

1. Timestamped Outputs​

2. Combine with Other Reporters​

3. Structured Storage​

4. Schema Validation​

Related Documentation​