OpenTelemetry is the standard for distributed tracing in modern API architectures. It captures spans across service boundaries, measures latency, and allows you to reconstruct the call chain for any request. But standard OpenTelemetry instrumentation treats an LLM call as an opaque HTTP request — it captures the latency and status code, and nothing else that matters.
What Standard Traces Miss for AI Workloads
- ▸Token counts (input and output) — essential for cost attribution and budget monitoring
- ▸Model version — the same endpoint may route to different model versions; which version handled this request?
- ▸System prompt hash — did the system prompt change between this call and yesterday's call?
- ▸Prompt classification — was this request flagged for any security patterns?
- ▸Confidence scores and output schema validation results
- ▸Cache hit/miss for semantic caching
Adding AI-Specific Attributes
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
async def call_llm(prompt: str, system: str) -> dict:
with tracer.start_as_current_span("llm.inference") as span:
span.set_attribute("llm.model", "claude-3-5-sonnet-20241022")
span.set_attribute("llm.system_prompt_hash", hashlib.sha256(system.encode()).hexdigest()[:16])
span.set_attribute("llm.input_tokens_estimate", len(prompt.split()) * 1.3)
response = await anthropic_client.messages.create(...)
span.set_attribute("llm.input_tokens", response.usage.input_tokens)
span.set_attribute("llm.output_tokens", response.usage.output_tokens)
span.set_attribute("llm.cost_usd", calculate_cost(response.usage))
span.set_attribute("llm.stop_reason", response.stop_reason)
return responseThe Security Observability Layer
Security events need their own trace attributes: whether the request was scanned for injection patterns, which patterns matched (if any), whether the output passed schema validation, and whether any circuit breakers or rate limits were triggered. Standard application traces do not include this information — it must be added at the gateway layer.
G8KEPR attaches AI-specific span attributes to every proxied LLM call: model version, token counts, prompt hash, security scan results, and cost. These are exported via OTLP to your existing observability stack — Grafana, Datadog, Honeycomb, or any OTLP-compatible backend.
