Database Tracking
Store, query, and analyze evaluation results using the Database Reporter for comprehensive historical tracking and trend analysis.
Overview
Location: examples/06-database-reporter/
Difficulty: Intermediate
What You'll Learn:
- Storing evaluation results in SQLite database
- Querying historical data with SQL
- Tracking trends over time
- Analyzing costs and performance
- Detecting regressions
- Building custom analytics
Why Database Reporter?
Problems with File-Based Reporting
JSON/HTML reporters:
- ❌ Each run overwrites previous results
- ❌ Can't compare across runs
- ❌ Manual work to track trends
- ❌ Difficult to analyze aggregate data
Solution: Database Reporter
- ✅ Append results (preserve history)
- ✅ Query with SQL
- ✅ Track trends automatically
- ✅ Compare runs easily
- ✅ Build dashboards
Capabilities
- 📊 Store all evaluation results in structured SQLite database
- 📈 Track trends over time (cost, performance, success rates)
- 🔍 Query historical data using SQL
- 💰 Analyze costs by provider, model, and test case
- 🎯 Monitor evaluator performance across runs
- 📉 Detect regressions by comparing current vs. historical results
- 🏗️ Build dashboards using database as backend
Files
06-database-reporter/
├── config.yaml # Configuration with database reporter
├── math_questions.evalset.json # Test cases
├── run_evaluation.py # Evaluation runner
├── query_results.py # Analysis queries
├── run.sh # Convenience script
└── README.md # Instructions
Quick Start
1. Set up API key
export GEMINI_API_KEY="your-api-key-here"
2. Run evaluation
cd examples/06-database-reporter
./run.sh
# Or: python run_evaluation.py
# Or: judge-llm run --config config.yaml
3. Query results
python query_results.py
Configuration
config.yaml
agent:
name: math_tutor_agent
description: "Math tutoring agent"
num_runs: 2 # Run each test twice for consistency check
dataset:
loader: local_file
paths:
- ./math_questions.evalset.json
providers:
- type: gemini
agent_id: math_tutor_agent
model: gemini-2.0-flash-exp
temperature: 0.7
evaluators:
- type: response_evaluator
config:
similarity_threshold: 0.85
reporters:
- type: console # Real-time output
- type: database # SQLite storage
db_path: "../../reports/06-database-reporter/results.db"
- type: html # Visual report
output_path: "../../reports/06-database-reporter/report.html"
Database Reporter Configuration
reporters:
- type: database
db_path: "path/to/results.db"
# Optional settings:
# auto_create_tables: true # Create tables if not exist (default)
# append_results: true # Append vs overwrite (default)
Key Points:
- Creates SQLite database at specified path
- Auto-creates tables on first run
- Appends new results (preserves history)
- No data loss between runs
Database Schema
Table 1: reports
High-level summary of each evaluation run:
CREATE TABLE reports (
report_id TEXT PRIMARY KEY,
generated_at TIMESTAMP,
total_cost REAL,
total_time REAL,
success_rate REAL,
overall_success BOOLEAN,
summary_json TEXT,
metadata TEXT
);
Fields:
report_id: Unique identifier for each rungenerated_at: Timestamp of evaluationtotal_cost: Sum of all test case coststotal_time: Total execution timesuccess_rate: Percentage of passing testsoverall_success: True if all tests passedsummary_json: Full summary as JSONmetadata: Additional context
Table 2: execution_runs
Individual test case executions:
CREATE TABLE execution_runs (
execution_id TEXT PRIMARY KEY,
report_id TEXT,
eval_case_id TEXT,
provider_type TEXT,
cost REAL,
time_taken REAL,
token_usage INTEGER,
overall_success BOOLEAN,
provider_result TEXT,
eval_case_data TEXT,
FOREIGN KEY (report_id) REFERENCES reports(report_id)
);
Fields:
execution_id: Unique execution identifiereval_case_id: Test case identifierprovider_type: Which provider was used (gemini, openai, etc.)cost: Cost for this executiontime_taken: Execution time in secondstoken_usage: Number of tokens usedoverall_success: Pass/fail statusprovider_result: Full provider response as JSONeval_case_data: Test case data as JSON
Table 3: evaluator_results
Detailed evaluator outcomes:
CREATE TABLE evaluator_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
execution_id TEXT,
evaluator_name TEXT,
evaluator_type TEXT,
success BOOLEAN,
passed BOOLEAN,
score REAL,
threshold REAL,
result_data TEXT,
error_message TEXT,
FOREIGN KEY (execution_id) REFERENCES execution_runs(execution_id)
);
Fields:
evaluator_name: Evaluator identifierevaluator_type: Type of evaluatorsuccess: Whether evaluator ran successfullypassed: Whether test passed this evaluatorscore: Numeric score (0-1)threshold: Required thresholdresult_data: Full result as JSONerror_message: Error if evaluator failed
Example Queries
1. Overall Summary
SELECT
COUNT(*) as total_runs,
SUM(total_cost) as cumulative_cost,
AVG(total_time) as avg_time,
AVG(success_rate) as avg_success_rate
FROM reports;
Output:
total_runs: 5
cumulative_cost: $0.0450
avg_time: 8.5s
avg_success_rate: 92.5%
2. Results by Test Case
SELECT
eval_case_id,
COUNT(*) as runs,
AVG(cost) as avg_cost,
AVG(time_taken) as avg_time,
SUM(CASE WHEN overall_success = 1 THEN 1 ELSE 0 END) as passed,
COUNT(*) - SUM(CASE WHEN overall_success = 1 THEN 1 ELSE 0 END) as failed
FROM execution_runs
GROUP BY eval_case_id
ORDER BY avg_cost DESC;
Output:
eval_case_id | runs | avg_cost | avg_time | passed | failed
--------------------------|------|----------|----------|--------|-------
complex_math_problem | 10 | $0.0025 | 2.5s | 8 | 2
simple_addition | 10 | $0.0008 | 1.1s | 10 | 0
word_problem | 10 | $0.0018 | 1.8s | 9 | 1
3. Evaluator Performance
SELECT
evaluator_name,
evaluator_type,
COUNT(*) as total,
SUM(passed) as passed,
AVG(score) as avg_score,
AVG(threshold) as threshold
FROM evaluator_results
GROUP BY evaluator_name, evaluator_type
ORDER BY avg_score DESC;
Output:
evaluator_name | type | total | passed | avg_score | threshold
--------------------|-------------------|-------|--------|-----------|----------
response_evaluator | response | 30 | 27 | 0.87 | 0.85
cost_evaluator | cost | 30 | 30 | 1.00 | N/A
latency_evaluator | latency | 30 | 28 | 0.93 | N/A
4. Daily Cost Trends
SELECT
DATE(generated_at) as date,
SUM(total_cost) as daily_cost,
COUNT(*) as runs,
AVG(success_rate) as avg_success_rate
FROM reports
GROUP BY DATE(generated_at)
ORDER BY date DESC;
Output:
date | daily_cost | runs | avg_success_rate
-----------|------------|------|------------------
2025-10-19 | $0.0180 | 3 | 95.0%
2025-10-18 | $0.0150 | 2 | 90.0%
2025-10-17 | $0.0120 | 2 | 92.5%
5. Provider Comparison
SELECT
provider_type,
COUNT(*) as runs,
AVG(cost) as avg_cost,
AVG(time_taken) as avg_time,
AVG(CASE WHEN overall_success = 1 THEN 1.0 ELSE 0.0 END) as success_rate
FROM execution_runs
GROUP BY provider_type;
Output:
provider_type | runs | avg_cost | avg_time | success_rate
--------------|------|----------|----------|-------------
gemini | 20 | $0.0015 | 1.5s | 95.0%
openai | 15 | $0.0025 | 1.2s | 93.3%
anthropic | 10 | $0.0035 | 2.1s | 100.0%
6. Detect Regressions
WITH recent_runs AS (
SELECT AVG(success_rate) as recent_success
FROM reports
WHERE generated_at >= datetime('now', '-7 days')
),
historical_runs AS (
SELECT AVG(success_rate) as historical_success
FROM reports
WHERE generated_at < datetime('now', '-7 days')
)
SELECT
recent_success,
historical_success,
(recent_success - historical_success) as change,
CASE
WHEN (recent_success - historical_success) < -5 THEN 'REGRESSION'
WHEN (recent_success - historical_success) > 5 THEN 'IMPROVEMENT'
ELSE 'STABLE'
END as status
FROM recent_runs, historical_runs;
Running Multiple Times
Build historical data by running repeatedly:
# Run 1
python run_evaluation.py
# Wait, make changes, run again
python run_evaluation.py
# Run 3
python run_evaluation.py
# Analyze trends
python query_results.py
Each run appends to the database - no data loss!
Using the Database
Method 1: SQLite CLI
# Open database
sqlite3 ../../reports/06-database-reporter/results.db
# List tables
.tables
# Describe table
.schema execution_runs
# Run query
SELECT * FROM reports;
# Export to CSV
.mode csv
.output results.csv
SELECT * FROM execution_runs;
.quit
Method 2: Python
import sqlite3
import pandas as pd
# Connect
conn = sqlite3.connect('results.db')
# Query
df = pd.read_sql_query("""
SELECT eval_case_id, AVG(cost) as avg_cost
FROM execution_runs
GROUP BY eval_case_id
""", conn)
print(df)
conn.close()
Method 3: GUI Tools
Open with any SQLite tool:
- DB Browser for SQLite (Free)
- DataGrip
- DBeaver (Free)
- VS Code SQLite extension
Use Cases
1. Cost Tracking
Monitor spending per test case:
sqlite3 results.db \
"SELECT eval_case_id, AVG(cost) as avg_cost
FROM execution_runs
GROUP BY eval_case_id
ORDER BY avg_cost DESC
LIMIT 10"
Use: Identify expensive tests to optimize
2. Performance Monitoring
Track execution times:
sqlite3 results.db \
"SELECT eval_case_id, AVG(time_taken) as avg_time
FROM execution_runs
GROUP BY eval_case_id
ORDER BY avg_time DESC
LIMIT 10"
Use: Find slow tests
3. Success Rate Trends
Monitor quality over time:
sqlite3 results.db \
"SELECT DATE(generated_at) as date, success_rate
FROM reports
ORDER BY generated_at DESC
LIMIT 30"
Use: Detect regressions early
4. Evaluator Analysis
Which evaluators fail most?
sqlite3 results.db \
"SELECT evaluator_name,
COUNT(*) as total,
SUM(passed) as passed,
ROUND(100.0 * SUM(passed) / COUNT(*), 2) as pass_rate
FROM evaluator_results
GROUP BY evaluator_name"
Use: Tune evaluation criteria
Advanced: Build a Dashboard
Using Streamlit
# dashboard.py
import streamlit as st
import sqlite3
import pandas as pd
import plotly.express as px
st.title("LLM Evaluation Dashboard")
conn = sqlite3.connect('results.db')
# Cost over time
cost_df = pd.read_sql_query("""
SELECT DATE(generated_at) as date, SUM(total_cost) as cost
FROM reports
GROUP BY DATE(generated_at)
ORDER BY date
""", conn)
fig = px.line(cost_df, x='date', y='cost', title='Daily Cost')
st.plotly_chart(fig)
# Success rate
success_df = pd.read_sql_query("""
SELECT DATE(generated_at) as date, AVG(success_rate) as rate
FROM reports
GROUP BY DATE(generated_at)
ORDER BY date
""", conn)
fig = px.line(success_df, x='date', y='rate', title='Success Rate')
st.plotly_chart(fig)
conn.close()
Run:
streamlit run dashboard.py
Expected Output
After running evaluation:
📊 Math Tutor Evaluation Results
================================================================================
📋 OVERALL SUMMARY
--------------------------------------------------------------------------------
Total evaluation reports: 3
Total test executions: 24
Successful executions: 22 (91.7%)
Cost metrics:
Total cost: $0.0450
Average cost per run: $0.0019
Time metrics:
Total time: 45.30s
Average time per run: 1.89s
📊 BY TEST CASE
--------------------------------------------------------------------------------
eval_case_id | runs | avg_cost | avg_time | passed | failed
-----------------------|------|----------|----------|--------|-------
math_001_addition | 6 | $0.0008 | 1.1s | 6 | 0
math_002_subtraction | 6 | $0.0010 | 1.3s | 5 | 1
math_003_multiplication| 6 | $0.0012 | 1.5s | 6 | 0
math_004_division | 6 | $0.0015 | 2.2s | 5 | 1
🎯 BY EVALUATOR
--------------------------------------------------------------------------------
evaluator_name | total | avg_score | pass_rate
---------------------|-------|-----------|----------
response_evaluator | 24 | 0.87 | 91.7%
cost_evaluator | 24 | 1.00 | 100.0%
latency_evaluator | 24 | 0.95 | 95.8%
Best Practices
1. Regular Runs
Run evaluations regularly to build trends:
# Cron job: daily at midnight
0 0 * * * cd /path/to/judge_llm && python run_evaluation.py
2. Backup Database
# Backup before major changes
cp results.db results.backup.$(date +%Y%m%d).db
3. Archive Old Data
-- Archive runs older than 90 days
DELETE FROM reports WHERE generated_at < datetime('now', '-90 days');
VACUUM; -- Reclaim space
4. Index for Performance
-- Add indexes for common queries
CREATE INDEX idx_generated_at ON reports(generated_at);
CREATE INDEX idx_eval_case_id ON execution_runs(eval_case_id);
CREATE INDEX idx_provider_type ON execution_runs(provider_type);
Next Steps
After mastering database tracking:
- Run multiple times to build historical data
- Try different providers to compare performance
- Create custom queries for your specific analytics
- Build a dashboard (Streamlit, Flask, Grafana)
- Set up alerts for regressions or cost spikes