Skip to main content

Latency Evaluator

Ensure agent performance meets speed requirements by measuring response times.

Overview

The Latency Evaluator monitors agent response times and fails tests that exceed latency thresholds. Essential for:

  • Performance SLAs
  • Real-time applications
  • User experience requirements
  • Timeout detection

Key Features:

  • Per-test-case latency limits
  • Automatic timing from providers
  • Percentile-based thresholds
  • Performance ratio analysis

Configuration

Basic Configuration

evaluators:
- type: latency_evaluator
config:
max_latency_seconds: 5.0 # 5 second limit

Full Configuration

evaluators:
- type: latency_evaluator
enabled: true
config:
max_latency_seconds: 5.0 # Maximum allowed latency
percentile: 100 # Percentile to evaluate (100 = max)

Latency Measurement

How Latency is Tracked

Providers measure end-to-end execution time:

start_time = time.time()
# Execute all conversation turns
# Call LLM API
# Process responses
end_time = time.time()

latency = end_time - start_time # Total seconds

Includes:

  • API call time
  • Network round trips
  • Response processing
  • All conversation turns

Typical Latencies

ProviderModelTypical RangeNotes
Mockmock-model1-10msInstant (no API)
Geminiflash-exp500-2000msFast, varies by load
Gemini1.5-pro1000-3000msSlower, better quality

Setting Latency Limits

By Application Type

# Real-time chat - Very strict
max_latency_seconds: 1.0

# Interactive Q&A - Moderate
max_latency_seconds: 3.0

# Background processing - Lenient
max_latency_seconds: 10.0

# Batch evaluation - Very lenient
max_latency_seconds: 30.0

By SLA Requirements

# 99th percentile must be < 2s
max_latency_seconds: 2.0
percentile: 99

# Max latency must be < 5s
max_latency_seconds: 5.0
percentile: 100

Usage Examples

Example 1: Basic Performance Check

# config.yaml
dataset:
loader: local_file
paths: [./perf_tests.json]

providers:
- type: gemini
agent_id: fast_agent
model: gemini-2.0-flash-exp

evaluators:
- type: latency_evaluator
config:
max_latency_seconds: 3.0

reporters:
- type: console

Example 2: Real-Time Requirements

evaluators:
- type: latency_evaluator
config:
max_latency_seconds: 1.0 # Strict 1s limit

- type: response_evaluator
config:
similarity_threshold: 0.75 # Lower quality bar for speed

Example 3: Per-Test-Case Limits

{
"eval_id": "simple_qa_001",
"conversation": [...],
"evaluator_config": {
"LatencyEvaluator": {
"max_latency_seconds": 1.0 // Quick questions = fast
}
}
},
{
"eval_id": "complex_analysis_001",
"conversation": [...],
"evaluator_config": {
"LatencyEvaluator": {
"max_latency_seconds": 10.0 // Complex = slower OK
}
}
}

Example 4: Provider Comparison

Compare latency across providers:

providers:
- type: gemini
agent_id: flash
model: gemini-2.0-flash-exp

- type: gemini
agent_id: pro
model: gemini-1.5-pro

evaluators:
- type: latency_evaluator
config:
max_latency_seconds: 5.0

reporters:
- type: html
config:
output_file: latency_comparison.html

Example 5: Programmatic Monitoring

from judge_llm import evaluate

report = evaluate(
dataset={"loader": "local_file", "paths": ["./tests.json"]},
providers=[{
"type": "gemini",
"agent_id": "test",
"model": "gemini-2.0-flash-exp"
}],
evaluators=[{
"type": "latency_evaluator",
"config": {"max_latency_seconds": 3.0}
}],
reporters=[{"type": "console"}]
)

# Calculate percentiles
latencies = [case.time_taken for case in report.test_cases]
latencies.sort()

p50 = latencies[len(latencies) // 2]
p95 = latencies[int(len(latencies) * 0.95)]
p99 = latencies[int(len(latencies) * 0.99)]

print(f"P50: {p50:.2f}s")
print(f"P95: {p95:.2f}s")
print(f"P99: {p99:.2f}s")

# Find slow test cases
for case in report.test_cases:
if case.time_taken > 5.0:
print(f"Slow: {case.eval_id} ({case.time_taken:.2f}s)")

Evaluation Result

The latency evaluator returns:

{
"evaluator_name": "LatencyEvaluator",
"evaluator_type": "latency_evaluator",
"passed": True,
"score": 1.0,
"threshold": 5.0,
"success": True,
"details": {
"actual_latency_seconds": 1.234,
"max_latency_seconds": 5.0,
"latency_ratio": 0.247, # 24.7% of max
"percentile": 100
}
}

Pass criteria: actual_latency ≤ max_latency_seconds

Failed Example

{
"passed": False,
"score": 0.0,
"details": {
"actual_latency_seconds": 6.5,
"max_latency_seconds": 5.0,
"latency_ratio": 1.3 # 30% over limit
}
}

Performance Optimization

1. Use Faster Models

# Slower
model: gemini-1.5-pro

# Faster (2-3x)
model: gemini-2.0-flash-exp

2. Reduce Response Length

providers:
- type: gemini
max_tokens: 256 # Shorter = faster

3. Use Lower Temperature

temperature: 0.3  # More focused = faster

4. Optimize Prompts

  • Shorter prompts = faster
  • Clear instructions = fewer tokens
  • Avoid unnecessary context

5. Parallel Execution

agent:
parallel_execution: true
max_workers: 4 # Run 4 tests concurrently

Note: Doesn't improve individual test latency, but speeds up total evaluation.

Best Practices

1. Set Realistic Limits

Account for network variability:

# Too strict (may fail due to network)
max_latency_seconds: 0.5

# Better (accounts for variance)
max_latency_seconds: 3.0

2. Test Under Load

Run performance tests with concurrent requests to simulate real load.

Track latency over time:

  • Detect performance regressions
  • Identify slow test cases
  • Optimize based on data

4. Balance Speed and Quality

evaluators:
# Speed matters
- type: latency_evaluator
config: {max_latency_seconds: 2.0}

# But quality too
- type: response_evaluator
config: {similarity_threshold: 0.75}

5. Use Appropriate Percentiles

# Average case
percentile: 50

# Most cases
percentile: 95

# Worst case (default)
percentile: 100

Troubleshooting

All Tests Failing Latency

Issue: Every test exceeds latency limit

Solutions:

  1. Check limit is realistic:

    max_latency_seconds: 0.1  # Too strict!
    max_latency_seconds: 3.0 # Better
  2. Use faster model:

    model: gemini-2.0-flash-exp
  3. Reduce response length:

    max_tokens: 256
  4. Check network:

    • Slow connection?
    • High latency to API?
    • Try from different network

Inconsistent Latency

Issue: Same test varies widely in latency

Causes:

  • Network variability
  • API server load
  • Time of day effects

Solutions:

  • Run multiple times and average
  • Use percentile thresholds (p95, p99)
  • Test at different times

Mock Provider Too Fast

Issue: Mock provider passes all latency tests

Expected: Mock is instant (~1ms), this is normal

Solution: Only use latency evaluator with real providers

API Reference

For implementation details, see the LatencyEvaluator API Reference.