Next Steps

You've completed the quick start tutorials! Here's where to go from here.

🎉 What You've Learned

✅ Installed Judge LLM and set up API keys
✅ Created test cases and configuration files
✅ Ran your first evaluation
✅ Used the Python API programmatically
✅ Compared multiple models
✅ Analyzed results and made decisions

📚 Continue Learning

Deep Dive into Features

1. Master Configuration

Learn all configuration options and patterns:

Configuration Guide - Complete reference
Default Configs - Reusable defaults
Environment Variables - Secure config

2. Explore Evaluators

Understand different evaluation methods:

Response Evaluator - LLM-as-judge
Trajectory Evaluator - Reasoning process
Cost Evaluator - Budget control
Latency Evaluator - Performance
Custom Evaluators - Build your own

3. Master Reporters

Learn about different output formats:

Console Reporter - Terminal output
HTML Reporter - Interactive reports
JSON Reporter - Machine-readable
Database Reporter - SQLite storage
Custom Reporters - Build your own

Explore Examples

Work through real-world examples:

01-gemini-agent - Basic setup
02-default-config - Using defaults
03-custom-evaluator - Custom evaluators
04-safety-evaluation - Multi-turn
05-config-override - Overrides
06-database-tracking - Historical tracking
CSV Reporter - See Custom Reporters
Component Registration - See Default Configs

View All Examples

🚀 Common Next Projects

Project 1: Set Up CI/CD Testing

Add LLM evaluation to your CI/CD pipeline:

GitHub Actions Example:

# .github/workflows/llm-eval.yml
name: LLM Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      
      - name: Install Judge LLM
        run: pip install judge-llm
      
      - name: Run Evaluation
        env:
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
        run: judge-llm run --config tests/eval.yaml
      
      - name: Upload Report
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: evaluation-report
          path: report.html

Learn More: CLI Reference

Project 2: Build Custom Evaluator

Create domain-specific validation:

# evaluators/my_evaluator.py
from judge_llm.evaluators.base import BaseEvaluator
from judge_llm.core.models import EvaluationResult

class MyDomainEvaluator(BaseEvaluator):
    def evaluate(self, test_case, response):
        # Your custom logic here
        content = response.get("content", "")
        
        # Check your domain-specific rules
        is_valid = self._check_domain_rules(content)
        
        return EvaluationResult(
            evaluator_type="my_domain",
            passed=is_valid,
            score=1.0 if is_valid else 0.0,
            reason="Valid" if is_valid else "Invalid"
        )
    
    def _check_domain_rules(self, content):
        # Implement your rules
        return True

Learn More: Custom Evaluators Guide

Project 3: Historical Tracking

Track evaluation results over time:

# config.yaml
reporters:
  - type: console
  - type: database
    db_path: ./history.db

Query trends:

SELECT
    DATE(timestamp) as date,
    AVG(success_rate) as avg_success,
    AVG(total_cost) as avg_cost
FROM evaluation_runs
GROUP BY DATE(timestamp)
ORDER BY date DESC
LIMIT 30;

Learn More: Database Reporter

Project 4: Multi-Provider Testing

Test across different providers:

providers:
  - type: gemini
    agent_id: gemini
  - type: openai
    agent_id: openai
  - type: anthropic
    agent_id: claude

evaluators:
  - type: response_evaluator
  - type: cost_evaluator
    max_cost: 0.01

reporters:
  - type: html
    output_path: ./provider-comparison.html

Learn More: Comparing Models

🔧 Advanced Topics

Custom Component Registration

# .judge_llm.defaults.yaml
evaluators:
  - type: custom
    module_path: ./evaluators/safety.py
    class_name: SafetyEvaluator
    register_as: safety

reporters:
  - type: custom
    module_path: ./reporters/slack.py
    class_name: SlackReporter
    register_as: slack

Use everywhere:

# test.yaml
evaluators:
  - type: safety
reporters:
  - type: slack
    webhook_url: ${SLACK_WEBHOOK}

Learn More: Default Configs

Programmatic Workflows

Build automated evaluation workflows:

from judge_llm import evaluate
import schedule
import time

def daily_evaluation():
    """Run daily evaluation and alert on failures"""
    report = evaluate(
        dataset={"loader": "local_file", "paths": ["./daily-tests.json"]},
        providers=[{"type": "gemini", "agent_id": "prod"}],
        evaluators=[{"type": "response_evaluator"}],
        reporters=[
            {"type": "database", "db_path": "./daily.db"},
            {"type": "slack", "webhook_url": os.getenv("SLACK_WEBHOOK")}
        ]
    )
    
    if not report.overall_success:
        send_alert(f"Daily evaluation failed: {report.success_rate:.1%}")

# Schedule daily at 9 AM
schedule.every().day.at("09:00").do(daily_evaluation)

while True:
    schedule.run_pending()
    time.sleep(60)

Learn More: Python API Reference

📖 Reference Documentation

Quick Links

CLI Reference - All CLI commands
Python API - Complete API docs
Configuration Guide - All config options
Evalset Format - Test case specification

Component Documentation

Evaluators:

Reporters:

🤝 Get Help

Troubleshooting

Check the troubleshooting sections in:

Basic Usage Guide
CLI Reference
Each component's documentation

Common Issues

API Key Not Found
- Check .env file exists
- Verify environment variables are set
- See Environment Variables
Tests Failing
- Review evaluator output for reasons
- Check expected responses are correct
- See First Evaluation
High Costs
- Add cost evaluator with limits
- Use cheaper models
- See Cost Evaluator

🎯 Choose Your Path

For Application Developers

Focus on integration and automation:

For QA Engineers

Focus on testing and validation:

For Data Scientists

Focus on analysis and comparison:

For DevOps Engineers

Focus on CI/CD and monitoring:

🚀 You're Ready!

You now have all the knowledge to:

✅ Evaluate LLMs effectively
✅ Compare models objectively
✅ Integrate into your workflow
✅ Build custom components
✅ Track results over time

Happy evaluating! 🎉

Need more help? Check out the complete documentation or explore examples.

🎉 What You've Learned​

📚 Continue Learning​

Deep Dive into Features​

1. Master Configuration​

2. Explore Evaluators​

3. Master Reporters​

Explore Examples​

🚀 Common Next Projects​

Project 1: Set Up CI/CD Testing​

Project 2: Build Custom Evaluator​

Project 3: Historical Tracking​

Project 4: Multi-Provider Testing​

🔧 Advanced Topics​

Custom Component Registration​

Programmatic Workflows​

📖 Reference Documentation​

Quick Links​

Component Documentation​

🤝 Get Help​

Troubleshooting​

Common Issues​

🎯 Choose Your Path​

For Application Developers​

For QA Engineers​

For Data Scientists​

For DevOps Engineers​

🚀 You're Ready!​