The circuit breaker pattern, originally described by Michael Nygard in Release It!, prevents a failing downstream service from cascading failure to every upstream caller. For traditional microservices, this is well-understood. For LLM API calls, the same pattern applies but with AI-specific parameters: inference latency is variable, timeouts are longer, and the cost of failed calls includes token charges for requests that returned an error.
The Three States
- ▸Closed — requests flow normally; failures are counted
- ▸Open — all requests fail immediately without hitting the LLM; a retry timer is running
- ▸Half-Open — one request is allowed through to test recovery; success closes the circuit, failure reopens it
AI-Specific Configuration
Timeout calibration
LLM responses are slow compared to microservice calls. A p99 response time of 30 seconds is normal for large context requests. Set your circuit breaker timeout based on observed p99 latency for your specific request type — not a generic 5-second microservice timeout that will trip constantly.
Fallback strategy
When the circuit is open, what does your application do? Options: return a cached response, route to a smaller/cheaper model, return a graceful degradation response ('I'm currently unavailable — try again in a moment'), or queue the request for retry. Define the fallback before you need it.
Cost-aware tripping
Standard circuit breakers trip on error rate and latency. For AI APIs, also consider tripping when the cost rate exceeds a budget threshold — a runaway inference loop that generates 1,000 requests per minute at $0.01 each should trip the circuit regardless of the error rate.
from g8kepr import CircuitBreaker
llm_circuit = CircuitBreaker(
failure_threshold=5, # Open after 5 failures in window
success_threshold=2, # Close after 2 successes in half-open
timeout=60, # Wait 60s before half-open probe
call_timeout=45.0, # Individual call timeout (p99 LLM latency)
cost_limit_per_minute=5.00, # Trip if spend exceeds $5/min
)
@llm_circuit
async def call_llm(prompt: str) -> str:
return await anthropic_client.messages.create(...)Related reading
Rate Limiting AI APIs: Token Budgets and Adaptive Throttling
Circuit breakers handle failure — rate limiting handles load. Both are required for production AI infrastructure.
Related reading
Semantic Caching for AI APIs: Reducing Cost Without Reducing Quality
Caching is the complement to circuit breaking — reduce load proactively so breakers trip less often.
Built-in circuit breakers for every LLM route
G8KEPR ships with configurable circuit breakers that trip on error rate, latency, and cost — no library integration required.
Start free trial