Skip to main content

Telemetry & Observability (OpenTelemetry)

Judge LLM includes optional OpenTelemetry (OTEL) instrumentation for deep observability into evaluation runs. Telemetry is disabled by default and adds zero overhead when not enabled.

Overview

When enabled, telemetry provides:

  • Distributed tracing across the full evaluation lifecycle
  • Span-level detail for every provider call, evaluator run, and report generation
  • Error tracking with retry attempts, HTTP status codes, and failure reasons
  • Performance metrics including latency, token usage, and cost per span
  • Integration with observability platforms like Arize Phoenix, Jaeger, Grafana Tempo, and Datadog

Installation

Telemetry dependencies are optional. Install only what you need:

# Console + OTLP exporters (Jaeger, Grafana Tempo, Datadog, etc.)
pip install judge-llm[telemetry]

# Arize Phoenix (LLM-focused observability)
pip install judge-llm[phoenix]

Enabling Telemetry

Choose any of these methods:

Method 1: CLI Flag

# Console exporter (prints spans to stdout)
judge-llm run --config config.yaml --telemetry

# OTLP exporter (sends to Jaeger, Grafana Tempo, etc.)
judge-llm run --config config.yaml --telemetry --telemetry-exporter otlp

# Arize Phoenix
judge-llm run --config config.yaml --telemetry --telemetry-exporter phoenix

Method 2: Environment Variable

export JUDGE_LLM_TELEMETRY=true
export OTEL_EXPORTER_TYPE=console # or "otlp" or "phoenix"

judge-llm run --config config.yaml

Method 3: YAML Configuration

agent:
log_level: INFO
telemetry:
enabled: true
exporter: console # "console", "otlp", or "phoenix"
service_name: judge-llm # optional, default: "judge-llm"
endpoint: http://localhost:4317 # optional, for otlp/phoenix

Method 4: Python API

from judge_llm import evaluate
from judge_llm.utils.telemetry import init_telemetry

# Initialize before calling evaluate
init_telemetry(exporter="phoenix", service_name="my-eval-pipeline")

report = evaluate(config="config.yaml")

Exporters

Console Exporter

Prints spans to stdout. Useful for debugging and development.

judge-llm run --config config.yaml --telemetry --telemetry-exporter console

No additional setup required.

OTLP Exporter

Sends spans to any OpenTelemetry-compatible backend via gRPC or HTTP.

# Set the OTLP endpoint
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

judge-llm run --config config.yaml --telemetry --telemetry-exporter otlp

Or via YAML:

agent:
telemetry:
enabled: true
exporter: otlp
endpoint: http://localhost:4317

Compatible backends:

Arize Phoenix

Arize Phoenix is an open-source observability platform built for LLM applications.

Setup Phoenix

# Install Phoenix
pip install arize-phoenix

# Start the Phoenix server
phoenix serve

Phoenix will be available at http://localhost:6006.

Run with Phoenix

judge-llm run --config config.yaml --telemetry --telemetry-exporter phoenix

Or via YAML:

agent:
telemetry:
enabled: true
exporter: phoenix
endpoint: http://localhost:6006 # default
service_name: my-eval-project # shows as project name in Phoenix

Or via environment variables:

export JUDGE_LLM_TELEMETRY=true
export OTEL_EXPORTER_TYPE=phoenix
export PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006

judge-llm run --config config.yaml

After running, open http://localhost:6006 to see your traces.

What Phoenix Shows

With OpenInference semantic conventions (automatically included in judge-llm[phoenix]), Phoenix displays:

  • Sessions — evaluation runs grouped by session ID, showing multi-turn conversation flow
  • Input/Output — actual user messages and agent response text on each span
  • LLM calls — model name, token counts (prompt/completion/total), and cost
  • HTTP details — full request/response payloads and headers on ADK HTTP spans
  • Evaluator results — pass/fail status, scores, and details
  • Span classification — spans categorized as CHAIN, LLM, TOOL, or EVALUATOR

Span Hierarchy

When telemetry is enabled, Judge LLM creates the following span tree for each evaluation:

judge_llm.evaluate                                [CHAIN]
├── judge_llm.execute_task [CHAIN, session.id, input/output]
│ ├── judge_llm.provider.execute [LLM, input/output, tokens, model]
│ │ ├── judge_llm.adk_http.create_session [TOOL, HTTP req/res body]
│ │ └── judge_llm.adk_http.send_and_collect [LLM, HTTP req/res body, tokens]
│ └── judge_llm.evaluator.evaluate [EVALUATOR, score, output]
└── judge_llm.reporter.generate [per reporter]

Span Attributes

Root Span: judge_llm.evaluate

AttributeTypeDescription
judge_llm.num_providersintNumber of providers configured
judge_llm.num_evaluatorsintNumber of evaluators configured
judge_llm.num_eval_setsintNumber of evaluation sets loaded
judge_llm.num_runsintRuns per eval case
judge_llm.parallelboolWhether parallel execution is enabled
judge_llm.total_executionsintTotal execution runs completed
judge_llm.success_ratefloatOverall success rate (0.0-1.0)
judge_llm.total_costfloatTotal cost across all executions
judge_llm.total_timefloatTotal wall-clock time (seconds)

Task Span: judge_llm.execute_task

AttributeTypeDescription
judge_llm.eval_case_idstrEvaluation case identifier
judge_llm.eval_set_idstrEvaluation set identifier
judge_llm.provider_typestrProvider type (e.g., "gemini", "adk_http")
judge_llm.run_numberintRun number (1-based)
judge_llm.task.successboolWhether the task passed all evaluators
openinference.span.kindstrCHAIN
session.idstrSession ID for Phoenix grouping
input.valuestrUser message(s) from the eval case
output.valuestrAgent response text

Provider Span: judge_llm.provider.execute

AttributeTypeDescription
judge_llm.provider_typestrProvider type
judge_llm.agent_idstrAgent identifier
judge_llm.provider.successboolProvider execution success
judge_llm.provider.costfloatExecution cost
judge_llm.provider.token_usage.totalintTotal tokens used
openinference.span.kindstrLLM
session.idstrSession ID for Phoenix grouping
input.valuestrUser input text
output.valuestrAgent response text
llm.model_namestrModel name
llm.token_count.promptintPrompt token count
llm.token_count.completionintCompletion token count
llm.token_count.totalintTotal token count

Events recorded on failure:

EventAttributesDescription
provider_errorerrorProvider execution error message

ADK HTTP Spans

judge_llm.adk_http.create_session

AttributeTypeDescription
judge_llm.adk_http.endpointstrBase endpoint URL
judge_llm.adk_http.app_namestrApplication name
judge_llm.adk_http.user_idstrUser ID
http.status_codeintHTTP response status code
judge_llm.adk_http.session_idstrCreated session ID
http.request.methodstrPOST
http.request.urlstrFull request URL
http.request.headersstrRequest headers
http.request.bodystrRequest payload
http.response.bodystrResponse body (truncated to 2KB)
openinference.span.kindstrTOOL
input.valuestrPOST {url}
output.valuestrSession ID and status
session.idstrCreated session ID
user.idstrUser ID

judge_llm.adk_http.send_and_collect

AttributeTypeDescription
judge_llm.adk_http.endpointstrRequest endpoint URL
judge_llm.adk_http.session_idstrSession ID
http.status_codeintHTTP response status code
http.request.methodstrPOST
http.request.urlstrFull request URL
http.request.headersstrRequest headers (auth excluded)
http.request.bodystrRequest payload (truncated to 4KB, state excluded)
http.response.status_codeintResponse status code
http.response.headersstrResponse headers
http.response.bodystrResponse body (truncated to 4KB)
judge_llm.adk_http.event_countintNumber of SSE events received
judge_llm.adk_http.content_typestrResponse content type
judge_llm.adk_http.attemptsintNumber of attempts (1 = no retries)
openinference.span.kindstrLLM
session.idstrSession ID
input.valuestrUser message text
output.valuestrAgent response text extracted from events
llm.model_namestrModel name
llm.token_count.promptintPrompt token count
llm.token_count.completionintCompletion token count
llm.token_count.totalintTotal token count

Events recorded on retries:

EventAttributesDescription
http_errorattempt, status_code, errorHTTP status error
request_errorattempt, errorConnection/timeout error
unexpected_errorattempt, errorOther errors

Evaluator Span: judge_llm.evaluator.evaluate

AttributeTypeDescription
judge_llm.evaluator.namestrEvaluator name
judge_llm.evaluator.passedboolWhether the evaluator passed
judge_llm.evaluator.scorefloatEvaluator score (-1 if N/A)
openinference.span.kindstrEVALUATOR
output.valuestrEvaluator result summary (passed, score, details)

Reporter Span: judge_llm.reporter.generate

AttributeTypeDescription
judge_llm.reporter.typestrReporter class name

Environment Variables Reference

VariableDescriptionDefault
JUDGE_LLM_TELEMETRYEnable telemetry (true, 1, yes)false
OTEL_EXPORTER_TYPEExporter type (console, otlp, phoenix)console
OTEL_EXPORTER_OTLP_ENDPOINTOTLP collector endpointhttp://localhost:4317
PHOENIX_COLLECTOR_ENDPOINTPhoenix collector endpointhttp://localhost:6006

Examples

Debug a Failing Provider

Enable telemetry to see exactly where a provider call fails:

JUDGE_LLM_TELEMETRY=true judge-llm run --config config.yaml -l DEBUG

The console exporter will show spans with error events, retry attempts, HTTP status codes, and timing for each step.

Monitor Evaluations in Phoenix

# Terminal 1: Start Phoenix
pip install arize-phoenix
phoenix serve

# Terminal 2: Run evaluation
pip install judge-llm[phoenix]
judge-llm run --config config.yaml --telemetry --telemetry-exporter phoenix

Open http://localhost:6006 to see:

  • Sessions grouping related spans by evaluation case
  • Input/Output showing user messages and agent responses on each span
  • LLM calls with model name, token counts, and cost
  • HTTP payloads with full request/response bodies and headers
  • Evaluator results with scores, pass/fail, and details
  • Full trace waterfall with timing for each step
  • Error details with retry history

Send Traces to Jaeger

# Start Jaeger (Docker)
docker run -d --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one:latest

# Run with OTLP exporter
pip install judge-llm[telemetry]
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
judge-llm run --config config.yaml --telemetry --telemetry-exporter otlp

Open http://localhost:16686 to view traces in Jaeger UI.

CI/CD with Telemetry

# GitHub Actions
- name: Run evaluation with telemetry
env:
JUDGE_LLM_TELEMETRY: true
OTEL_EXPORTER_TYPE: otlp
OTEL_EXPORTER_OTLP_ENDPOINT: ${{ secrets.OTEL_ENDPOINT }}
run: |
pip install judge-llm[telemetry]
judge-llm run --config config.yaml

Behavior When Disabled

When telemetry is not enabled (the default):

  • No dependencies required - opentelemetry packages are not imported
  • Zero overhead - all tracing calls are no-ops that return immediately
  • No side effects - no spans created, no data sent anywhere
  • Safe to leave in code - instrumentation points are always present but inactive