Skip to main content

LLM Judge Evaluator

Use an LLM as a judge to evaluate response quality, relevance, hallucination, and factuality.

Overview

The LLM Judge Evaluator sends responses to an LLM (like Gemini) that acts as a judge to assess various quality dimensions. This enables sophisticated evaluation that goes beyond simple text matching.

Type: llm_judge_evaluator

Key Features:

  • Multiple evaluation types (relevance, hallucination, quality, factuality)
  • Comprehensive multi-dimensional scoring
  • Custom evaluation prompts
  • Configurable scoring thresholds
  • Detailed reasoning and feedback

Quick Start

evaluators:
- type: llm_judge_evaluator
config:
evaluation_type: comprehensive
min_score: 3.0

Configuration

Basic Configuration

evaluators:
- type: llm_judge_evaluator

Full Configuration

evaluators:
- type: llm_judge_evaluator
enabled: true
config:
model: gemini-2.0-flash # Judge model
api_key: ${GOOGLE_API_KEY} # API key (or use env var)
evaluation_type: comprehensive # Type of evaluation
min_score: 3.0 # Minimum score to pass (1-5 scale)
temperature: 0.0 # LLM temperature (0 for consistency)
max_retries: 2 # Retries on parse failure
custom_prompt: null # Custom evaluation prompt

Configuration Options

OptionTypeDefaultDescription
modelstringgemini-2.0-flashLLM model for judging
api_keystring(env var)API key for the model
evaluation_typestringcomprehensiveType of evaluation
min_scorefloat3.0Minimum score to pass (1-5)
temperaturefloat0.0LLM temperature
max_retriesint2Max retries on parse failure
custom_promptstringnullCustom evaluation prompt

Evaluation Types

Relevance

Assesses how well the response addresses the user's query:

config:
evaluation_type: relevance

Scoring (1-5):

  • 5: Highly relevant, fully addresses the query
  • 4: Mostly relevant, minor gaps
  • 3: Somewhat relevant, addresses part of query
  • 2: Mostly irrelevant, tangentially related
  • 1: Completely irrelevant

Hallucination Detection

Checks for fabricated or false information:

config:
evaluation_type: hallucination

Scoring (1-5):

  • 5: No hallucination, all accurate
  • 4: Minimal, very minor imprecisions
  • 3: Minor hallucination, mostly accurate
  • 2: Significant hallucination
  • 1: Severe hallucination, multiple false claims

Additional output: List of hallucinated claims

Quality Assessment

Evaluates overall response quality:

config:
evaluation_type: quality

Scoring dimensions:

  • Helpfulness
  • Clarity
  • Completeness
  • Coherence
  • Tone

Factuality

Checks factual accuracy against expected response:

config:
evaluation_type: factuality

Additional output: List of factual errors found

Comprehensive (Default)

Multi-dimensional assessment covering all aspects:

config:
evaluation_type: comprehensive

Returns:

  • Overall score (1-5)
  • Dimension scores: relevance, accuracy, completeness, clarity, helpfulness
  • Strengths and suggested improvements

Usage Examples

Example 1: Comprehensive Evaluation

evaluators:
- type: llm_judge_evaluator
config:
evaluation_type: comprehensive
min_score: 3.5

Use case: General-purpose quality assessment.

Example 2: Hallucination Detection

evaluators:
- type: llm_judge_evaluator
config:
evaluation_type: hallucination
min_score: 4.0 # Strict - minimal hallucination allowed

Use case: Fact-critical applications, medical/legal content.

Example 3: Custom Evaluation Prompt

evaluators:
- type: llm_judge_evaluator
config:
custom_prompt: |
You are evaluating a customer support response.

User Query: {user_query}
Response: {response}
Expected Response: {expected_response}

Rate the response on professionalism and helpfulness (1-5).

Respond with ONLY JSON: {{"score": <1-5>, "reasoning": "<explanation>"}}

Use case: Domain-specific evaluation criteria.

Example 4: Strict Quality Gate

evaluators:
- type: llm_judge_evaluator
config:
evaluation_type: quality
min_score: 4.0
temperature: 0.0 # Consistent scoring

Use case: Production quality gates.

Example 5: Per-Case Override

{
"eval_id": "factual_qa_001",
"evaluator_config": {
"LLMJudgeEvaluator": {
"evaluation_type": "factuality",
"min_score": 4.5
}
}
}

Evaluation Result

The evaluator returns detailed results:

{
"evaluator_name": "LLMJudgeEvaluator",
"evaluator_type": "llm_judge_evaluator",
"passed": True,
"score": 0.75, # Normalized to 0-1
"threshold": 0.5, # Normalized threshold
"success": True,
"details": {
"evaluation_type": "comprehensive",
"model": "gemini-2.0-flash",
"min_score": 3.0,
"average_score": 4.0, # Original 1-5 scale
"normalized_score": 0.75,
"all_invocations_passed": True,
"num_invocations": 1,
"invocation_results": [
{
"invocation": 0,
"score": 4.0,
"passed": True,
"evaluation_type": "comprehensive",
"llm_response": {
"overall_score": 4,
"dimensions": {
"relevance": 5,
"accuracy": 4,
"completeness": 4,
"clarity": 4,
"helpfulness": 4
},
"reasoning": "Response addresses the query well...",
"strengths": ["Clear explanation", "Accurate facts"],
"improvements": ["Could provide more examples"]
}
}
]
}
}

When to Use

Use LLM Judge When:

  • Evaluating subjective quality (tone, helpfulness)
  • Detecting hallucinations and false claims
  • Assessing relevance to user intent
  • Testing open-ended or creative responses
  • Need nuanced feedback beyond text matching

Don't Use When:

  • Exact output matching is required
  • Cost is a primary concern (each eval = API call)
  • Evaluating structured data (JSON, code syntax)
  • Simple pass/fail criteria exist

Best Practices

1. Use Low Temperature

config:
temperature: 0.0 # Consistent, reproducible scores

2. Set Appropriate Thresholds

# Strict quality (production)
min_score: 4.0

# Moderate quality (development)
min_score: 3.0

# Lenient (early prototyping)
min_score: 2.5

3. Combine with Other Evaluators

evaluators:
- type: response_evaluator # Fast, cheap check
config:
similarity_threshold: 0.6
- type: llm_judge_evaluator # Detailed quality check
config:
evaluation_type: comprehensive
min_score: 3.5

4. Use Specific Evaluation Types

# For Q&A systems
evaluation_type: factuality

# For chatbots
evaluation_type: quality

# For search/retrieval
evaluation_type: relevance

5. Custom Prompts for Domains

config:
custom_prompt: |
Evaluate this medical advice response...
Check for: accuracy, safety warnings, clarity
...

Cost Considerations

Each evaluation makes an API call to the judge LLM:

ScenarioAPI CallsCost Impact
100 test cases, 1 turn each100Low
100 test cases, 5 turns each500Medium
1000 test cases1000+High

Tips to reduce cost:

  • Use smaller judge models (gemini-2.0-flash)
  • Pre-filter with cheaper evaluators first
  • Use sampling for large test suites

Troubleshooting

JSON Parse Errors

Issue: "Failed to parse JSON response"

Solutions:

  • Increase max_retries
  • Use lower temperature
  • Simplify custom prompts

Inconsistent Scores

Issue: Same content gets different scores

Solutions:

config:
temperature: 0.0 # Deterministic output
max_retries: 3 # Retry on failures

API Key Errors

Issue: "LLM client not available"

Solutions:

export GOOGLE_API_KEY=your-key-here

Low Scores Despite Good Responses

Issue: Judge is too strict

Solutions:

  • Lower min_score threshold
  • Review expected responses for realism
  • Use more lenient evaluation type