Skip to main content

Response Evaluator

Validate LLM response quality by comparing actual outputs against expected responses using industry-standard similarity metrics.

Overview

The Response Evaluator measures how well the agent's responses match expected outputs. It supports multiple similarity metrics and automatically selects the best available method.

Key Features:

  • Multiple similarity metrics (ROUGE, Jaccard, Recall, Exact)
  • Automatic ROUGE-1 selection when available
  • Configurable thresholds
  • Case sensitivity and whitespace normalization
  • Multi-turn conversation support
  • Detailed similarity breakdowns

Installation

The response evaluator is built-in - no additional installation required!

For ROUGE metrics (recommended):

pip install judge-llm[rouge]
# or
pip install rouge-score

Configuration

Basic Configuration

evaluators:
- type: response_evaluator

Uses defaults: 0.8 similarity threshold, semantic matching.

Full Configuration

evaluators:
- type: response_evaluator
enabled: true
config:
similarity_threshold: 0.8 # Pass threshold (0.0-1.0)
match_type: semantic # exact, semantic, rouge, recall
case_sensitive: false # Case-sensitive matching
normalize_whitespace: true # Normalize whitespace/list markers

Similarity Metrics

Industry-standard metric for text similarity used in summarization and translation tasks.

config:
match_type: rouge # Or auto-selected if rouge-score installed

How it works:

  • Calculates ROUGE-1 F1 score (unigram overlap)
  • Balances precision and recall
  • Uses stemming for better matching
  • Range: 0.0 (no match) to 1.0 (perfect match)

Example:

Expected: "The capital of France is Paris"
Actual: "Paris is the capital of France"
ROUGE-1: 0.95 (very high similarity despite different order)

Best for: Most use cases, natural language responses

2. Semantic (Jaccard Index)

Word-based similarity using set intersection/union.

config:
match_type: semantic

How it works:

  • Compares word sets (not order)
  • Score = intersection / union
  • Penalizes both missing and extra words equally

Example:

Expected: "red blue green"
Actual: "blue green yellow"
Jaccard: 0.5 (2 overlap / 4 unique words)

Best for: When ROUGE unavailable, keyword matching

3. Recall

Word recall metric focusing on expected content coverage.

config:
match_type: recall

How it works:

  • Score = overlap / expected_words
  • More lenient for responses with extra information
  • Doesn't penalize additional content

Example:

Expected: "red blue"
Actual: "red blue green yellow"
Recall: 1.0 (all expected words present)

Best for: When extra information is acceptable

4. Exact Match

Strict string equality after normalization.

config:
match_type: exact

How it works:

  • Compares normalized strings
  • Returns 1.0 (match) or 0.0 (no match)
  • Applies case and whitespace normalization

Example:

Expected: "Hello World"
Actual: "hello world"
Score: 1.0 (if case_sensitive: false)

Best for: Deterministic tests, exact outputs required

Metric Selection Guide

MetricProsConsUse When
ROUGEIndustry standard, robustRequires packageNatural language (recommended)
SemanticNo dependencies, balancedSimple word matchingKeyword/short responses
RecallLenient, allows extrasMay miss wrong infoExtra details OK
ExactDeterministic, fastVery strictExact matches needed

Threshold Configuration

# Strict - For critical/factual responses
similarity_threshold: 0.9

# Balanced - General purpose (default)
similarity_threshold: 0.8

# Lenient - For creative/varied responses
similarity_threshold: 0.7

# Very Lenient - For exploratory/brainstorming
similarity_threshold: 0.6

Task-Specific Thresholds

# Math problems - Exact answers
config: {similarity_threshold: 0.95, match_type: exact}

# Translations - High similarity
config: {similarity_threshold: 0.85, match_type: rouge}

# Creative writing - Moderate similarity
config: {similarity_threshold: 0.6, match_type: semantic}

# Code generation - Exact match
config: {similarity_threshold: 1.0, match_type: exact}

Best Practices

1. Use ROUGE When Possible

Install rouge-score for best results:

pip install rouge-score

ROUGE is industry-standard and more robust than simple word matching.

2. Set Realistic Thresholds

Start with 0.8 and adjust:

# Start here
similarity_threshold: 0.8

# Adjust based on results
# - Too many failures? Lower to 0.7
# - Need stricter? Raise to 0.9

3. Test with Mock First

Use mock provider to validate evalsets:

providers:
- type: mock # Returns expected responses
agent_id: test

evaluators:
- type: response_evaluator

Should get 100% pass rate (validates your evalset format).

API Reference

For implementation details, see the ResponseEvaluator API Reference.