Cost Evaluator

Control evaluation budgets by setting maximum cost thresholds per test case.

Overview

The Cost Evaluator ensures your evaluation costs stay within budget by:

Tracking actual API costs per test case
Comparing against configured maximums
Failing tests that exceed cost limits
Monitoring cost efficiency

Key Features:

Per-test-case cost limits
Automatic cost tracking from providers
Currency support
Cost ratio analysis
Budget enforcement

Configuration

Basic Configuration

evaluators:
  - type: cost_evaluator
    config:
      max_cost_per_case: 0.10  # $0.10 per test case

Full Configuration

evaluators:
  - type: cost_evaluator
    enabled: true
    config:
      max_cost_per_case: 0.10   # Maximum cost in USD
      currency: USD             # Currency code

Cost Tracking

How Costs are Calculated

Providers automatically track costs:

Token counting: Count input/output tokens
Pricing lookup: Apply provider-specific rates
Cost calculation: tokens × rate
Total aggregation: Sum across all turns

Example (Gemini):

Input: 100 tokens × $0.00025/1K = $0.000025
Output: 50 tokens × $0.0005/1K = $0.000025
Total: $0.00005

Provider Costs

Provider	Input (per 1K tokens)	Output (per 1K tokens)
Gemini Flash	$0.00025	$0.0005
Gemini Pro	$0.00125	$0.005
Mock	$0.00	$0.00
Custom	Varies	Varies

Setting Cost Limits

By Test Complexity

# Simple Q&A
max_cost_per_case: 0.001  # $0.001

# Multi-turn conversations
max_cost_per_case: 0.01   # $0.01

# Complex reasoning
max_cost_per_case: 0.10   # $0.10

# Large document processing
max_cost_per_case: 1.00   # $1.00

By Budget

# Total budget: $10 for 100 test cases
max_cost_per_case: 0.10  # $10 / 100 = $0.10 per case

Usage Examples

Example 1: Basic Cost Control

# config.yaml
dataset:
  loader: local_file
  paths: [./tests.json]

providers:
  - type: gemini
    agent_id: cost_aware_agent
    model: gemini-2.0-flash-exp

evaluators:
  - type: cost_evaluator
    config:
      max_cost_per_case: 0.01

reporters:
  - type: console

Run evaluation:

judge-llm run --config config.yaml

Example 2: Different Limits for Different Tests

{
  "eval_id": "simple_qa_001",
  "conversation": [...],
  "evaluator_config": {
    "CostEvaluator": {
      "max_cost_per_case": 0.001  // Simple = cheap
    }
  }
},
{
  "eval_id": "complex_reasoning_001",
  "conversation": [...],
  "evaluator_config": {
    "CostEvaluator": {
      "max_cost_per_case": 0.10  // Complex = higher budget
    }
  }
}

Example 3: Cost Comparison

Compare costs across providers:

providers:
  - type: gemini
    agent_id: gemini_flash
    model: gemini-2.0-flash-exp
    
  - type: gemini
    agent_id: gemini_pro
    model: gemini-1.5-pro

evaluators:
  - type: cost_evaluator
    config:
      max_cost_per_case: 0.05
  - type: response_evaluator

reporters:
  - type: html
    config:
      output_file: cost_comparison.html

View HTML report to compare costs.

Example 4: Programmatic Cost Monitoring

from judge_llm import evaluate

report = evaluate(
    dataset={"loader": "local_file", "paths": ["./tests.json"]},
    providers=[{
        "type": "gemini",
        "agent_id": "test",
        "model": "gemini-2.0-flash-exp"
    }],
    evaluators=[{
        "type": "cost_evaluator",
        "config": {"max_cost_per_case": 0.01}
    }],
    reporters=[{"type": "console"}]
)

# Check total costs
print(f"Total cost: ${report.total_cost:.4f}")
print(f"Avg cost per case: ${report.total_cost / len(report.test_cases):.4f}")

# Find expensive test cases
for case in report.test_cases:
    if case.cost > 0.01:
        print(f"{case.eval_id}: ${case.cost:.4f} (over budget!)")

Evaluation Result

The cost evaluator returns:

{
    "evaluator_name": "CostEvaluator",
    "evaluator_type": "cost_evaluator",
    "passed": True,
    "score": 1.0,
    "threshold": 0.10,
    "success": True,
    "details": {
        "actual_cost": 0.000234,
        "max_cost": 0.10,
        "currency": "USD",
        "cost_ratio": 0.00234  # actual / max
    }
}

Pass criteria: actual_cost ≤ max_cost_per_case

Failed Example

{
    "passed": False,
    "score": 0.0,
    "details": {
        "actual_cost": 0.15,
        "max_cost": 0.10,
        "currency": "USD",
        "cost_ratio": 1.5  # 50% over budget
    }
}

Cost Optimization Tips

1. Use Cheaper Models

# Expensive
model: gemini-1.5-pro

# Cheaper (5x less)
model: gemini-2.0-flash-exp

2. Limit Response Length

providers:
  - type: gemini
    model: gemini-2.0-flash-exp
    max_tokens: 512  # Limit output tokens

3. Reduce Temperature

Lower temperature = more focused = fewer tokens:

temperature: 0.3  # More focused, cheaper

4. Use Mock for Development

# Development - Free
providers:
  - type: mock

# Production - Real costs
providers:
  - type: gemini

5. Batch Similar Tests

Group similar tests to benefit from caching (if provider supports it).

Best Practices

1. Set Realistic Limits

Start generous, then tighten:

# Week 1: Learn costs
max_cost_per_case: 1.00

# Week 2: Optimize based on data
max_cost_per_case: 0.10

# Week 3: Production limits
max_cost_per_case: 0.05

2. Monitor Trends

Track costs over time:

Average cost per test case
Total evaluation costs
Cost by test case type
Cost by provider

3. Use with Response Evaluator

Balance cost and quality:

evaluators:
  - type: cost_evaluator
    config: {max_cost_per_case: 0.01}
  - type: response_evaluator
    config: {similarity_threshold: 0.8}

Find the sweet spot: good quality at reasonable cost.

4. Budget Alerts

Set up alerts for cost overruns:

if report.total_cost > MONTHLY_BUDGET:
    send_alert(f"Budget exceeded: ${report.total_cost}")

Troubleshooting

All Tests Failing Cost Check

Issue: Every test exceeds cost limit

Solutions:

Check limit is reasonable:

max_cost_per_case: 0.000001  # Too low!
max_cost_per_case: 0.01      # Better

Use cheaper model:
```
model: gemini-2.0-flash-exp  # Cheapest
```

Reduce max_tokens:

max_tokens: 512  # Limit response length

Cost is Always $0.00

Issue: Provider not tracking costs

Solutions:

Check provider type:
- Mock provider: Always $0 (expected)
- Gemini provider: Should track costs
Verify provider implementation: Custom providers must calculate costs in execute()

Unexpected High Costs

Issue: Costs higher than expected

Causes:

Long conversations (many turns)
Large input contexts
High temperature (verbose responses)
Wrong model selected

Debug:

# Check token usage
for case in report.test_cases:
    print(f"{case.eval_id}:")
    print(f"  Tokens: {case.token_usage.total_tokens}")
    print(f"  Cost: ${case.cost:.4f}")

API Reference

For implementation details, see the CostEvaluator API Reference.

Overview​

Configuration​

Basic Configuration​

Full Configuration​

Cost Tracking​

How Costs are Calculated​

Provider Costs​

Setting Cost Limits​

By Test Complexity​

By Budget​

Usage Examples​

Example 1: Basic Cost Control​

Example 2: Different Limits for Different Tests​

Example 3: Cost Comparison​

Example 4: Programmatic Cost Monitoring​

Evaluation Result​

Failed Example​

Cost Optimization Tips​

1. Use Cheaper Models​

2. Limit Response Length​

3. Reduce Temperature​

4. Use Mock for Development​

5. Batch Similar Tests​

Best Practices​

1. Set Realistic Limits​

2. Monitor Trends​

3. Use with Response Evaluator​

4. Budget Alerts​

Troubleshooting​

All Tests Failing Cost Check​

Cost is Always $0.00​

Unexpected High Costs​

Related Documentation​

API Reference​