Skip to main content

JSON Reporter

Export evaluation results as structured JSON for programmatic analysis, data pipelines, and tool integration.

Overview

The JSON Reporter outputs evaluation results as a structured JSON file. Perfect for automation, data analysis, and integration with other tools.

Key Features:

  • Machine-readable structured format
  • Complete evaluation data export
  • Easy to parse and process
  • Version control friendly
  • Schema-consistent output

Configuration

Basic Configuration

reporters:
- type: json
output_path: ./results.json

Python API

from judge_llm import evaluate

report = evaluate(
dataset={"loader": "local_file", "paths": ["./tests.json"]},
providers=[{"type": "gemini", "agent_id": "test"}],
evaluators=[{"type": "response_evaluator"}],
reporters=[{"type": "json", "output_path": "./results.json"}]
)

Output Format

Top-Level Structure

{
"total_cost": 0.0234,
"total_time": 12.45,
"success_rate": 0.85,
"overall_success": true,
"summary": {
"total_executions": 10,
"successful_executions": 8,
"failed_executions": 2
},
"test_cases": [...]
}

Test Case Structure

{
"eval_id": "test_001",
"agent_id": "my_agent",
"provider_type": "gemini",
"passed": true,
"cost": 0.00234,
"time_taken": 1.234,
"evaluation_results": [...],
"conversation_history": [...]
}

Use Cases

Programmatic Analysis

import json

# Load results
with open("results.json") as f:
data = json.load(f)

# Analyze pass rate
print(f"Success rate: {data['success_rate'] * 100:.1f}%")

# Find expensive tests
expensive = [
tc for tc in data['test_cases']
if tc['cost'] > 0.01
]
print(f"Expensive tests: {len(expensive)}")

# Calculate average latency
avg_latency = sum(tc['time_taken'] for tc in data['test_cases']) / len(data['test_cases'])
print(f"Average latency: {avg_latency:.2f}s")

CI/CD Integration

# .github/workflows/eval.yml
- name: Run Evaluations
run: judge-llm run --config test.yaml --report json --output results.json

- name: Analyze Results
run: python analyze_results.py results.json

- name: Fail if Low Pass Rate
run: |
pass_rate=$(jq '.success_rate' results.json)
if (( $(echo "$pass_rate < 0.9" | bc -l) )); then
echo "Pass rate too low: $pass_rate"
exit 1
fi

Data Pipelines

# ETL pipeline
import json
import pandas as pd

# Load JSON results
with open("results.json") as f:
data = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(data['test_cases'])

# Analyze
print(df.groupby('provider_type')['cost'].sum())
print(df[['eval_id', 'passed', 'time_taken']])

# Export to CSV
df.to_csv("analysis.csv", index=False)

Version Control

# Track results over time
git add results.json
git commit -m "Evaluation results for v2.0"

# Compare with previous
git diff HEAD~1 results.json

Best Practices

1. Timestamped Outputs

reporters:
- type: json
output_path: ./results/eval-{{timestamp}}.json

2. Combine with Other Reporters

reporters:
- type: console # Monitor progress
- type: json # Export for analysis
output_path: ./results.json

3. Structured Storage

results/
├── 2025-01-15/
│ ├── morning-eval.json
│ └── afternoon-eval.json
└── 2025-01-16/
└── daily-eval.json

4. Schema Validation

import json
import jsonschema

# Define expected schema
schema = {
"type": "object",
"required": ["total_cost", "success_rate", "test_cases"],
"properties": {
"success_rate": {"type": "number", "minimum": 0, "maximum": 1}
}
}

# Validate
with open("results.json") as f:
data = json.load(f)
jsonschema.validate(data, schema)