Next Steps
You've completed the quick start tutorials! Here's where to go from here.
🎉 What You've Learned
✅ Installed Judge LLM and set up API keys
✅ Created test cases and configuration files
✅ Ran your first evaluation
✅ Used the Python API programmatically
✅ Compared multiple models
✅ Analyzed results and made decisions
📚 Continue Learning
Deep Dive into Features
1. Master Configuration
Learn all configuration options and patterns:
- Configuration Guide - Complete reference
- Default Configs - Reusable defaults
- Environment Variables - Secure config
2. Explore Evaluators
Understand different evaluation methods:
- Response Evaluator - LLM-as-judge
- Trajectory Evaluator - Reasoning process
- Cost Evaluator - Budget control
- Latency Evaluator - Performance
- Custom Evaluators - Build your own
3. Master Reporters
Learn about different output formats:
- Console Reporter - Terminal output
- HTML Reporter - Interactive reports
- JSON Reporter - Machine-readable
- Database Reporter - SQLite storage
- Custom Reporters - Build your own
Explore Examples
Work through real-world examples:
- 01-gemini-agent - Basic setup
- 02-default-config - Using defaults
- 03-custom-evaluator - Custom evaluators
- 04-safety-evaluation - Multi-turn
- 05-config-override - Overrides
- 06-database-tracking - Historical tracking
- CSV Reporter - See Custom Reporters
- Component Registration - See Default Configs
🚀 Common Next Projects
Project 1: Set Up CI/CD Testing
Add LLM evaluation to your CI/CD pipeline:
GitHub Actions Example:
# .github/workflows/llm-eval.yml
name: LLM Evaluation
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install Judge LLM
run: pip install judge-llm
- name: Run Evaluation
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
run: judge-llm run --config tests/eval.yaml
- name: Upload Report
if: always()
uses: actions/upload-artifact@v3
with:
name: evaluation-report
path: report.html
Project 2: Build Custom Evaluator
Create domain-specific validation:
# evaluators/my_evaluator.py
from judge_llm.evaluators.base import BaseEvaluator
from judge_llm.core.models import EvaluationResult
class MyDomainEvaluator(BaseEvaluator):
def evaluate(self, test_case, response):
# Your custom logic here
content = response.get("content", "")
# Check your domain-specific rules
is_valid = self._check_domain_rules(content)
return EvaluationResult(
evaluator_type="my_domain",
passed=is_valid,
score=1.0 if is_valid else 0.0,
reason="Valid" if is_valid else "Invalid"
)
def _check_domain_rules(self, content):
# Implement your rules
return True
Learn More: Custom Evaluators Guide
Project 3: Historical Tracking
Track evaluation results over time:
# config.yaml
reporters:
- type: console
- type: database
db_path: ./history.db
Query trends:
SELECT
DATE(timestamp) as date,
AVG(success_rate) as avg_success,
AVG(total_cost) as avg_cost
FROM evaluation_runs
GROUP BY DATE(timestamp)
ORDER BY date DESC
LIMIT 30;
Project 4: Multi-Provider Testing
Test across different providers:
providers:
- type: gemini
agent_id: gemini
- type: openai
agent_id: openai
- type: anthropic
agent_id: claude
evaluators:
- type: response_evaluator
- type: cost_evaluator
max_cost: 0.01
reporters:
- type: html
output_path: ./provider-comparison.html
🔧 Advanced Topics
Custom Component Registration
Register components globally for reuse:
# .judge_llm.defaults.yaml
evaluators:
- type: custom
module_path: ./evaluators/safety.py
class_name: SafetyEvaluator
register_as: safety
reporters:
- type: custom
module_path: ./reporters/slack.py
class_name: SlackReporter
register_as: slack
Use everywhere:
# test.yaml
evaluators:
- type: safety
reporters:
- type: slack
webhook_url: ${SLACK_WEBHOOK}
Programmatic Workflows
Build automated evaluation workflows:
from judge_llm import evaluate
import schedule
import time
def daily_evaluation():
"""Run daily evaluation and alert on failures"""
report = evaluate(
dataset={"loader": "local_file", "paths": ["./daily-tests.json"]},
providers=[{"type": "gemini", "agent_id": "prod"}],
evaluators=[{"type": "response_evaluator"}],
reporters=[
{"type": "database", "db_path": "./daily.db"},
{"type": "slack", "webhook_url": os.getenv("SLACK_WEBHOOK")}
]
)
if not report.overall_success:
send_alert(f"Daily evaluation failed: {report.success_rate:.1%}")
# Schedule daily at 9 AM
schedule.every().day.at("09:00").do(daily_evaluation)
while True:
schedule.run_pending()
time.sleep(60)
Learn More: Python API Reference
📖 Reference Documentation
Quick Links
- CLI Reference - All CLI commands
- Python API - Complete API docs
- Configuration Guide - All config options
- Evalset Format - Test case specification
Component Documentation
Evaluators:
Reporters:
🤝 Get Help
Troubleshooting
Check the troubleshooting sections in:
- Basic Usage Guide
- CLI Reference
- Each component's documentation
Common Issues
-
API Key Not Found
- Check
.envfile exists - Verify environment variables are set
- See Environment Variables
- Check
-
Tests Failing
- Review evaluator output for reasons
- Check expected responses are correct
- See First Evaluation
-
High Costs
- Add cost evaluator with limits
- Use cheaper models
- See Cost Evaluator
🎯 Choose Your Path
For Application Developers
Focus on integration and automation:
For QA Engineers
Focus on testing and validation:
For Data Scientists
Focus on analysis and comparison:
For DevOps Engineers
Focus on CI/CD and monitoring:
🚀 You're Ready!
You now have all the knowledge to:
- ✅ Evaluate LLMs effectively
- ✅ Compare models objectively
- ✅ Integrate into your workflow
- ✅ Build custom components
- ✅ Track results over time
Happy evaluating! 🎉
Need more help? Check out the complete documentation or explore examples.