Trajectory Evaluator

Validate conversation flow and tool usage patterns in multi-turn dialogues.

Overview

The Trajectory Evaluator checks whether the agent follows the expected conversation path, including:

Correct number of conversation turns
Expected tool usage sequences
Proper intermediate steps
Structured dialogue flow

Key Features:

Multi-turn conversation validation
Tool use sequence matching
Exact and partial matching modes
Intermediate step verification

Configuration

Basic Configuration

evaluators:
  - type: trajectory_evaluator

Full Configuration

evaluators:
  - type: trajectory_evaluator
    enabled: true
    config:
      sequence_match_type: exact    # exact or partial
      allow_partial_match: false    # Allow partial tool matches

Matching Modes

Exact Match (Default)

Requires tool sequences to match exactly:

config:
  sequence_match_type: exact

Pass criteria:

Same number of tools
Same tool names
Same order

Example:

Expected: [search, calculate, respond]
Actual:   [search, calculate, respond]  ✓ PASS

Actual:   [search, respond]              ✗ FAIL (missing tool)
Actual:   [calculate, search, respond]  ✗ FAIL (wrong order)

Partial Match

Allows some overlap in tool usage:

config:
  sequence_match_type: partial
  allow_partial_match: true

Pass criteria:

Some tool overlap required
Order doesn't matter
Extra tools allowed

Example:

Expected: [search, calculate]
Actual:   [search, calculate, respond]  ✓ PASS (overlap exists)

Actual:   [respond, summarize]          ✗ FAIL (no overlap)

Usage Examples

Example 1: Validate Customer Support Flow

# config.yaml
dataset:
  loader: local_file
  paths: [./support_flow.json]

providers:
  - type: gemini
    agent_id: support_agent
    model: gemini-2.0-flash-exp

evaluators:
  - type: trajectory_evaluator
    config:
      sequence_match_type: exact

reporters:
  - type: console

Evalset:

{
  "eval_id": "support_001",
  "conversation": [
    {
      "invocation_id": "inv-1",
      "user_content": {"parts": [{"text": "I need help"}]},
      "intermediate_data": {
        "tool_uses": [
          {"name": "search_kb", "input": {"query": "help"}}
        ]
      },
      "final_response": {"parts": [{"text": "How can I help?"}]}
    },
    {
      "invocation_id": "inv-2",
      "user_content": {"parts": [{"text": "My order is late"}]},
      "intermediate_data": {
        "tool_uses": [
          {"name": "lookup_order", "input": {"order_id": "123"}},
          {"name": "check_shipping", "input": {"tracking": "ABC"}}
        ]
      },
      "final_response": {"parts": [{"text": "Your order will arrive tomorrow"}]}
    }
  ]
}

Example 2: Lenient Tool Matching

evaluators:
  - type: trajectory_evaluator
    config:
      sequence_match_type: partial
      allow_partial_match: true

Good for exploratory agents that may use different tool combinations.

Example 3: Per-Case Override

{
  "eval_id": "strict_flow_001",
  "conversation": [...],
  "evaluator_config": {
    "TrajectoryEvaluator": {
      "sequence_match_type": "exact"
    }
  }
},
{
  "eval_id": "flexible_flow_001",
  "conversation": [...],
  "evaluator_config": {
    "TrajectoryEvaluator": {
      "sequence_match_type": "partial"
    }
  }
}

Evaluation Result

The trajectory evaluator returns detailed results:

{
    "evaluator_name": "TrajectoryEvaluator",
    "evaluator_type": "trajectory_evaluator",
    "passed": True,
    "score": 1.0,
    "success": True,
    "details": {
        "sequence_match_type": "exact",
        "allow_partial_match": false,
        "match_rate": 1.0,
        "tool_matches": [
            {
                "invocation": 0,
                "expected_tool_count": 1,
                "actual_tool_count": 1,
                "match": true,
                "expected_tools": ["search_kb"],
                "actual_tools": ["search_kb"]
            },
            {
                "invocation": 1,
                "expected_tool_count": 2,
                "actual_tool_count": 2,
                "match": true,
                "expected_tools": ["lookup_order", "check_shipping"],
                "actual_tools": ["lookup_order", "check_shipping"]
            }
        ]
    }
}

When to Use

Use Trajectory Evaluator When:

Testing multi-step workflows
Validating agent planning
Checking tool usage patterns
Ensuring consistent execution paths
Testing dialogue systems

Don't Use When:

Single-turn conversations (no trajectory to validate)
Tool order doesn't matter
Only final response quality matters
Agents should be creative/exploratory

Best Practices

1. Define Clear Expected Paths

Be explicit about expected tool sequences:

{
  "intermediate_data": {
    "tool_uses": [
      {"name": "search", "input": {...}},
      {"name": "filter", "input": {...}},
      {"name": "respond", "input": {...}}
    ]
  }
}

2. Use Exact Match for Critical Paths

For safety-critical or compliance workflows:

config:
  sequence_match_type: exact

3. Use Partial Match for Exploration

For research/creative tasks:

config:
  sequence_match_type: partial

4. Combine with Response Evaluator

Validate both trajectory and response quality:

evaluators:
  - type: trajectory_evaluator
  - type: response_evaluator

Troubleshooting

All Trajectories Failing

Issue: All test cases fail trajectory check

Solutions:

Check tool names match exactly:

Expected: "search_database"
Actual:   "searchDatabase"  ✗ (case-sensitive)

Use partial matching if appropriate:
```
sequence_match_type: partial
```
Review intermediate_data structure: Ensure tool_uses array is properly formatted

Conversation Length Mismatch

Error: conversation_length mismatch

Cause: Different number of turns

Solution: Ensure expected and actual conversations have same number of invocations

API Reference

For implementation details, see the TrajectoryEvaluator API Reference.

Overview​

Configuration​

Basic Configuration​

Full Configuration​

Matching Modes​

Exact Match (Default)​

Partial Match​

Usage Examples​

Example 1: Validate Customer Support Flow​

Example 2: Lenient Tool Matching​

Example 3: Per-Case Override​

Evaluation Result​

When to Use​

Use Trajectory Evaluator When:​

Don't Use When:​

Best Practices​

1. Define Clear Expected Paths​

2. Use Exact Match for Critical Paths​

3. Use Partial Match for Exploration​

4. Combine with Response Evaluator​

Troubleshooting​

All Trajectories Failing​

Conversation Length Mismatch​

Related Documentation​

API Reference​