Standard API rate limiting counts requests per time window. This is correct for REST APIs where requests have approximately uniform cost. It is wrong for AI APIs where a single request can consume anywhere from 10 to 200,000 tokens and cost proportionally.
The Three Algorithms
Token bucket
A bucket fills at a fixed rate and drains with each request. Allows bursting up to bucket capacity, then enforces the fill rate. Best for APIs where brief bursts are acceptable. Traditional implementation counts requests — for AI APIs, each "token" in the bucket should represent a unit of compute, not a request.
Sliding window
Tracks requests in a rolling time window. More accurate than fixed windows (avoids the "double limit" problem at window boundaries) but more memory-intensive. Again, must be adapted to count tokens rather than requests for AI endpoints.
Token budget (AI-specific)
The correct abstraction for AI APIs is a token budget: a per-session, per-key, or per-org limit on total tokens consumed in a time period. A budget of 500K tokens per hour is meaningful regardless of whether that was 5 large requests or 5,000 small ones.
Implementing Token Budget Rate Limiting
The challenge: you do not know the token count of a request until the model responds (for completions) or until you count the input tokens (which requires tokenizing the input before forwarding). Practical approach: count input tokens before forwarding, estimate output tokens based on max_tokens parameter, deduct from budget pessimistically, refund unused estimate after response.
Why This Matters for Security
Without token-aware rate limiting, a single authenticated user can exhaust your API budget — and depending on your pricing model, generate significant cost — with a handful of requests. This is a denial-of-service vector that bypasses traditional rate limiting completely. It is also how prompt injection attacks can amplify their cost impact.
