LLM Judge Evaluator

Use an LLM as a judge to evaluate response quality, relevance, hallucination, and factuality.

Overview

The LLM Judge Evaluator sends responses to an LLM (like Gemini) that acts as a judge to assess various quality dimensions. This enables sophisticated evaluation that goes beyond simple text matching.

Type: llm_judge_evaluator

Key Features:

Multiple evaluation types (relevance, hallucination, quality, factuality)
Comprehensive multi-dimensional scoring
Custom evaluation prompts
Configurable scoring thresholds
Detailed reasoning and feedback

Quick Start

evaluators:
  - type: llm_judge_evaluator
    config:
      evaluation_type: comprehensive
      min_score: 3.0

Configuration

Basic Configuration

evaluators:
  - type: llm_judge_evaluator

Full Configuration

evaluators:
  - type: llm_judge_evaluator
    enabled: true
    config:
      model: gemini-2.0-flash            # Judge model
      api_key: ${GOOGLE_API_KEY}         # API key (or use env var)
      evaluation_type: comprehensive     # Type of evaluation
      min_score: 3.0                     # Minimum score to pass (1-5 scale)
      temperature: 0.0                   # LLM temperature (0 for consistency)
      max_retries: 2                     # Retries on parse failure
      custom_prompt: null                # Custom evaluation prompt

Configuration Options

Option	Type	Default	Description
`model`	string	`gemini-2.0-flash`	LLM model for judging
`api_key`	string	(env var)	API key for the model
`evaluation_type`	string	`comprehensive`	Type of evaluation
`min_score`	float	`3.0`	Minimum score to pass (1-5)
`temperature`	float	`0.0`	LLM temperature
`max_retries`	int	`2`	Max retries on parse failure
`custom_prompt`	string	`null`	Custom evaluation prompt

Evaluation Types

Relevance

Assesses how well the response addresses the user's query:

config:
  evaluation_type: relevance

Scoring (1-5):

5: Highly relevant, fully addresses the query
4: Mostly relevant, minor gaps
3: Somewhat relevant, addresses part of query
2: Mostly irrelevant, tangentially related
1: Completely irrelevant

Hallucination Detection

Checks for fabricated or false information:

config:
  evaluation_type: hallucination

Scoring (1-5):

5: No hallucination, all accurate
4: Minimal, very minor imprecisions
3: Minor hallucination, mostly accurate
2: Significant hallucination
1: Severe hallucination, multiple false claims

Additional output: List of hallucinated claims

Quality Assessment

Evaluates overall response quality:

config:
  evaluation_type: quality

Scoring dimensions:

Helpfulness
Clarity
Completeness
Coherence
Tone

Factuality

Checks factual accuracy against expected response:

config:
  evaluation_type: factuality

Additional output: List of factual errors found

Comprehensive (Default)

Multi-dimensional assessment covering all aspects:

config:
  evaluation_type: comprehensive

Returns:

Overall score (1-5)
Dimension scores: relevance, accuracy, completeness, clarity, helpfulness
Strengths and suggested improvements

Usage Examples

Example 1: Comprehensive Evaluation

evaluators:
  - type: llm_judge_evaluator
    config:
      evaluation_type: comprehensive
      min_score: 3.5

Use case: General-purpose quality assessment.

Example 2: Hallucination Detection

evaluators:
  - type: llm_judge_evaluator
    config:
      evaluation_type: hallucination
      min_score: 4.0  # Strict - minimal hallucination allowed

Use case: Fact-critical applications, medical/legal content.

Example 3: Custom Evaluation Prompt

evaluators:
  - type: llm_judge_evaluator
    config:
      custom_prompt: |
        You are evaluating a customer support response.

        User Query: {user_query}
        Response: {response}
        Expected Response: {expected_response}

        Rate the response on professionalism and helpfulness (1-5).

        Respond with ONLY JSON: {{"score": <1-5>, "reasoning": "<explanation>"}}

Use case: Domain-specific evaluation criteria.

Example 4: Strict Quality Gate

evaluators:
  - type: llm_judge_evaluator
    config:
      evaluation_type: quality
      min_score: 4.0
      temperature: 0.0  # Consistent scoring

Use case: Production quality gates.

Example 5: Per-Case Override

{
  "eval_id": "factual_qa_001",
  "evaluator_config": {
    "LLMJudgeEvaluator": {
      "evaluation_type": "factuality",
      "min_score": 4.5
    }
  }
}

Evaluation Result

The evaluator returns detailed results:

{
    "evaluator_name": "LLMJudgeEvaluator",
    "evaluator_type": "llm_judge_evaluator",
    "passed": True,
    "score": 0.75,              # Normalized to 0-1
    "threshold": 0.5,           # Normalized threshold
    "success": True,
    "details": {
        "evaluation_type": "comprehensive",
        "model": "gemini-2.0-flash",
        "min_score": 3.0,
        "average_score": 4.0,   # Original 1-5 scale
        "normalized_score": 0.75,
        "all_invocations_passed": True,
        "num_invocations": 1,
        "invocation_results": [
            {
                "invocation": 0,
                "score": 4.0,
                "passed": True,
                "evaluation_type": "comprehensive",
                "llm_response": {
                    "overall_score": 4,
                    "dimensions": {
                        "relevance": 5,
                        "accuracy": 4,
                        "completeness": 4,
                        "clarity": 4,
                        "helpfulness": 4
                    },
                    "reasoning": "Response addresses the query well...",
                    "strengths": ["Clear explanation", "Accurate facts"],
                    "improvements": ["Could provide more examples"]
                }
            }
        ]
    }
}

When to Use

Use LLM Judge When:

Evaluating subjective quality (tone, helpfulness)
Detecting hallucinations and false claims
Assessing relevance to user intent
Testing open-ended or creative responses
Need nuanced feedback beyond text matching

Don't Use When:

Exact output matching is required
Cost is a primary concern (each eval = API call)
Evaluating structured data (JSON, code syntax)
Simple pass/fail criteria exist

Best Practices

1. Use Low Temperature

config:
  temperature: 0.0  # Consistent, reproducible scores

2. Set Appropriate Thresholds

# Strict quality (production)
min_score: 4.0

# Moderate quality (development)
min_score: 3.0

# Lenient (early prototyping)
min_score: 2.5

3. Combine with Other Evaluators

evaluators:
  - type: response_evaluator          # Fast, cheap check
    config:
      similarity_threshold: 0.6
  - type: llm_judge_evaluator         # Detailed quality check
    config:
      evaluation_type: comprehensive
      min_score: 3.5

4. Use Specific Evaluation Types

# For Q&A systems
evaluation_type: factuality

# For chatbots
evaluation_type: quality

# For search/retrieval
evaluation_type: relevance

5. Custom Prompts for Domains

config:
  custom_prompt: |
    Evaluate this medical advice response...
    Check for: accuracy, safety warnings, clarity
    ...

Cost Considerations

Each evaluation makes an API call to the judge LLM:

Scenario	API Calls	Cost Impact
100 test cases, 1 turn each	100	Low
100 test cases, 5 turns each	500	Medium
1000 test cases	1000+	High

Tips to reduce cost:

Use smaller judge models (gemini-2.0-flash)
Pre-filter with cheaper evaluators first
Use sampling for large test suites

Troubleshooting

JSON Parse Errors

Issue: "Failed to parse JSON response"

Solutions:

Increase max_retries
Use lower temperature
Simplify custom prompts

Inconsistent Scores

Issue: Same content gets different scores

Solutions:

config:
  temperature: 0.0  # Deterministic output
  max_retries: 3    # Retry on failures

API Key Errors

Issue: "LLM client not available"

Solutions:

export GOOGLE_API_KEY=your-key-here

Low Scores Despite Good Responses

Issue: Judge is too strict

Solutions:

Lower min_score threshold
Review expected responses for realism
Use more lenient evaluation type

Overview​

Quick Start​

Configuration​

Basic Configuration​

Full Configuration​

Configuration Options​

Evaluation Types​

Relevance​

Hallucination Detection​

Quality Assessment​

Factuality​

Comprehensive (Default)​

Usage Examples​

Example 1: Comprehensive Evaluation​

Example 2: Hallucination Detection​

Example 3: Custom Evaluation Prompt​

Example 4: Strict Quality Gate​

Example 5: Per-Case Override​

Evaluation Result​

When to Use​

Use LLM Judge When:​

Don't Use When:​

Best Practices​

1. Use Low Temperature​

2. Set Appropriate Thresholds​

3. Combine with Other Evaluators​

4. Use Specific Evaluation Types​

5. Custom Prompts for Domains​

Cost Considerations​

Troubleshooting​

JSON Parse Errors​

Inconsistent Scores​

API Key Errors​

Low Scores Despite Good Responses​

Related Documentation​

Overview

Quick Start

Configuration

Basic Configuration

Full Configuration

Configuration Options

Evaluation Types

Relevance

Hallucination Detection

Quality Assessment

Factuality

Comprehensive (Default)

Usage Examples

Example 1: Comprehensive Evaluation

Example 2: Hallucination Detection

Example 3: Custom Evaluation Prompt

Example 4: Strict Quality Gate

Example 5: Per-Case Override

Evaluation Result

When to Use

Use LLM Judge When:

Don't Use When:

Best Practices

1. Use Low Temperature

2. Set Appropriate Thresholds

3. Combine with Other Evaluators

4. Use Specific Evaluation Types

5. Custom Prompts for Domains

Cost Considerations

Troubleshooting

JSON Parse Errors

Inconsistent Scores

API Key Errors

Low Scores Despite Good Responses

Related Documentation