Comparing Models
Learn how to compare different LLM models side-by-side to find the best one for your use case.
A/B Testing Models
Compare two models on the same test cases:
# ab-test.yaml
dataset:
loader: local_file
paths:
- ./tests.json
providers:
- type: gemini
agent_id: gemini_flash
model: gemini-2.0-flash-exp
- type: openai
agent_id: gpt4
model: gpt-4
evaluators:
- type: response_evaluator
- type: cost_evaluator
max_cost: 0.05
- type: latency_evaluator
max_latency: 5.0
reporters:
- type: console
- type: html
output_path: ./model-comparison.html
Run the comparison:
judge-llm run --config ab-test.yaml
Understanding the Results
The HTML report shows:
- Per-Model Success Rates: Which model has higher accuracy
- Cost Comparison: Which model is more cost-effective
- Latency Comparison: Which model is faster
- Test-by-Test Breakdown: See where each model excels
Example Output
Summary:
gemini_flash:
Total Tests: 10
Passed: 9
Failed: 1
Success Rate: 90.0%
Total Cost: $0.0087
Avg Latency: 1.2s
gpt4:
Total Tests: 10
Passed: 10
Failed: 0
Success Rate: 100.0%
Total Cost: $0.0234
Avg Latency: 2.1s
Multi-Model Comparison
Compare 3+ models:
providers:
- type: gemini
agent_id: gemini_flash
model: gemini-2.0-flash-exp
- type: gemini
agent_id: gemini_pro
model: gemini-pro
- type: openai
agent_id: gpt4
model: gpt-4
- type: anthropic
agent_id: claude
model: claude-3-5-sonnet-20241022
Analyzing Results with Python
Use the Python API for deeper analysis:
from judge_llm import evaluate
import pandas as pd
report = evaluate(
dataset={"loader": "local_file", "paths": ["./tests.json"]},
providers=[
{"type": "gemini", "agent_id": "gemini"},
{"type": "openai", "agent_id": "gpt4"},
{"type": "anthropic", "agent_id": "claude"}
],
evaluators=[{"type": "response_evaluator"}]
)
# Convert to DataFrame
data = []
for tc in report.test_cases:
data.append({
"test_id": tc.eval_id,
"provider": tc.agent_id,
"passed": tc.passed,
"cost": tc.cost,
"latency": tc.time_taken
})
df = pd.DataFrame(data)
# Aggregate by provider
summary = df.groupby("provider").agg({
"passed": ["count", "sum", "mean"],
"cost": ["sum", "mean"],
"latency": "mean"
}).round(4)
print("\nModel Comparison:")
print(summary)
# Find winner
best_accuracy = summary[("passed", "mean")].idxmax()
best_cost = summary[("cost", "sum")].idxmin()
best_speed = summary[("latency", "mean")].idxmin()
print(f"\nBest Accuracy: {best_accuracy}")
print(f"Best Cost: {best_cost}")
print(f"Best Speed: {best_speed}")
Cost vs. Quality Tradeoff
Visualize cost vs. quality:
import matplotlib.pyplot as plt
# Group by provider
providers = df.groupby("provider").agg({
"passed": "mean",
"cost": "sum"
})
# Plot
plt.figure(figsize=(10, 6))
plt.scatter(providers["cost"], providers["passed"], s=100)
for provider in providers.index:
plt.annotate(
provider,
(providers.loc[provider, "cost"], providers.loc[provider, "passed"]),
xytext=(5, 5),
textcoords="offset points"
)
plt.xlabel("Total Cost ($)")
plt.ylabel("Success Rate")
plt.title("Model Comparison: Cost vs. Quality")
plt.grid(True, alpha=0.3)
plt.show()
Testing Model Variants
Compare different configurations of the same model:
providers:
- type: gemini
agent_id: gemini_temp_0
model: gemini-2.0-flash-exp
temperature: 0.0 # Deterministic
- type: gemini
agent_id: gemini_temp_0.7
model: gemini-2.0-flash-exp
temperature: 0.7 # Creative
- type: gemini
agent_id: gemini_temp_1.0
model: gemini-2.0-flash-exp
temperature: 1.0 # Very creative
Winner Selection Strategy
1. Accuracy First
Choose the model with highest success rate:
winner = df.groupby("provider")["passed"].mean().idxmax()
print(f"Most accurate model: {winner}")
2. Cost First
Choose the cheapest model that meets minimum quality:
MIN_SUCCESS_RATE = 0.9
# Filter by minimum quality
qualified = df.groupby("provider").agg({
"passed": "mean",
"cost": "sum"
})
qualified = qualified[qualified["passed"] >= MIN_SUCCESS_RATE]
# Find cheapest among qualified
winner = qualified["cost"].idxmin()
print(f"Best value model: {winner}")
3. Speed First
Choose the fastest model:
winner = df.groupby("provider")["latency"].mean().idxmin()
print(f"Fastest model: {winner}")
4. Balanced Scoring
Weight multiple factors:
WEIGHTS = {
"accuracy": 0.5,
"cost": 0.3,
"speed": 0.2
}
providers = df.groupby("provider").agg({
"passed": "mean",
"cost": "sum",
"latency": "mean"
})
# Normalize (0-1 scale)
providers["accuracy_norm"] = providers["passed"]
providers["cost_norm"] = 1 - (providers["cost"] / providers["cost"].max())
providers["speed_norm"] = 1 - (providers["latency"] / providers["latency"].max())
# Calculate weighted score
providers["score"] = (
providers["accuracy_norm"] * WEIGHTS["accuracy"] +
providers["cost_norm"] * WEIGHTS["cost"] +
providers["speed_norm"] * WEIGHTS["speed"]
)
winner = providers["score"].idxmax()
print(f"Best balanced model: {winner}")
print(f"Score: {providers.loc[winner, 'score']:.3f}")
Storing Results for Tracking
Use database reporter to track comparisons over time:
reporters:
- type: console
- type: database
db_path: ./model_comparisons.db
Query historical data:
SELECT
agent_id,
AVG(passed) as avg_success_rate,
AVG(cost) as avg_cost,
AVG(time_taken) as avg_latency,
COUNT(*) as total_tests
FROM test_cases
GROUP BY agent_id
ORDER BY avg_success_rate DESC;
Best Practices
1. Use Same Test Set
Always compare models on identical test cases:
dataset:
loader: local_file
paths:
- ./tests.json # Same for all models
2. Multiple Runs
Run each comparison multiple times for statistical significance:
for i in {1..5}; do
judge-llm run --config ab-test.yaml --report database --db-path results.db
done
3. Diverse Test Cases
Include various scenarios:
- Simple factual questions
- Complex reasoning tasks
- Edge cases
- Multi-turn conversations
4. Control Variables
Keep other factors constant:
- Same temperature (unless testing temperature)
- Same system prompt
- Same evaluators
- Same test order
Next Steps
- Configuration Guide - Advanced configuration options
- Database Reporter - Track results over time
- Cost Evaluator - Budget control
- Latency Evaluator - Performance testing
Related Documentation
- Python API - Programmatic analysis
- HTML Reporter - Visual comparisons
- Examples - More comparison examples