Rate Limiting for AI APIs: Token Bucket vs Sliding Window vs Token Budget

Standard API rate limiting counts requests per time window. This is correct for REST APIs where requests have approximately uniform cost. It is wrong for AI APIs where a single request can consume anywhere from 10 to 200,000 tokens and cost proportionally.

The Three Algorithms

Token bucket

A bucket fills at a fixed rate and drains with each request. Allows bursting up to bucket capacity, then enforces the fill rate. Best for APIs where brief bursts are acceptable. Traditional implementation counts requests — for AI APIs, each "token" in the bucket should represent a unit of compute, not a request.

Sliding window

Tracks requests in a rolling time window. More accurate than fixed windows (avoids the "double limit" problem at window boundaries) but more memory-intensive. Again, must be adapted to count tokens rather than requests for AI endpoints.

Token budget (AI-specific)

The correct abstraction for AI APIs is a token budget: a per-session, per-key, or per-org limit on total tokens consumed in a time period. A budget of 500K tokens per hour is meaningful regardless of whether that was 5 large requests or 5,000 small ones.

Implementing Token Budget Rate Limiting

The challenge: you do not know the token count of a request until the model responds (for completions) or until you count the input tokens (which requires tokenizing the input before forwarding). Practical approach: count input tokens before forwarding, estimate output tokens based on max_tokens parameter, deduct from budget pessimistically, refund unused estimate after response.

Why This Matters for Security

Without token-aware rate limiting, a single authenticated user can exhaust your API budget — and depending on your pricing model, generate significant cost — with a handful of requests. This is a denial-of-service vector that bypasses traditional rate limiting completely. It is also how prompt injection attacks can amplify their cost impact.

ShareX / Twitter LinkedIn

Rate Limiting for AI APIs: Token Bucket vs Sliding Window vs Token Budget

The Three Algorithms

Token bucket

Sliding window

Token budget (AI-specific)

Implementing Token Budget Rate Limiting

Why This Matters for Security

Related Articles

Row-Level Security in PostgreSQL: The Last Line of Defense for Multi-Tenant SaaS

Audit Log Integrity: Why Hash-Chaining Beats Encryption

API Security vs AI Gateway: Why You Need Both

Ready to secure your AI stack?