Circuit Breakers for AI Pipelines: Preventing Cascade Failures at the LLM Layer

The circuit breaker pattern, originally described by Michael Nygard in Release It!, prevents a failing downstream service from cascading failure to every upstream caller. For traditional microservices, this is well-understood. For LLM API calls, the same pattern applies but with AI-specific parameters: inference latency is variable, timeouts are longer, and the cost of failed calls includes token charges for requests that returned an error.

The Three States

▸Closed — requests flow normally; failures are counted
▸Open — all requests fail immediately without hitting the LLM; a retry timer is running
▸Half-Open — one request is allowed through to test recovery; success closes the circuit, failure reopens it

AI-Specific Configuration

Timeout calibration

LLM responses are slow compared to microservice calls. A p99 response time of 30 seconds is normal for large context requests. Set your circuit breaker timeout based on observed p99 latency for your specific request type — not a generic 5-second microservice timeout that will trip constantly.

Fallback strategy

When the circuit is open, what does your application do? Options: return a cached response, route to a smaller/cheaper model, return a graceful degradation response ('I'm currently unavailable — try again in a moment'), or queue the request for retry. Define the fallback before you need it.

Cost-aware tripping

Standard circuit breakers trip on error rate and latency. For AI APIs, also consider tripping when the cost rate exceeds a budget threshold — a runaway inference loop that generates 1,000 requests per minute at $0.01 each should trip the circuit regardless of the error rate.

python

from g8kepr import CircuitBreaker

llm_circuit = CircuitBreaker(
    failure_threshold=5,          # Open after 5 failures in window
    success_threshold=2,          # Close after 2 successes in half-open
    timeout=60,                   # Wait 60s before half-open probe
    call_timeout=45.0,            # Individual call timeout (p99 LLM latency)
    cost_limit_per_minute=5.00,   # Trip if spend exceeds $5/min
)

@llm_circuit
async def call_llm(prompt: str) -> str:
    return await anthropic_client.messages.create(...)

Circuit Breakers for AI Pipelines: Preventing Cascade Failures at the LLM Layer

The Three States

AI-Specific Configuration

Timeout calibration

Fallback strategy

Cost-aware tripping

Related Articles

Row-Level Security in PostgreSQL: The Last Line of Defense for Multi-Tenant SaaS

Audit Log Integrity: Why Hash-Chaining Beats Encryption

API Security vs AI Gateway: Why You Need Both

Ready to secure your AI stack?