Skip to main content
Rate Limiting for AI APIs: Token Bucket vs Sliding Window vs Token Budget — G8KEPR Blog
Back to Blog
Architecture7 min readFebruary 28, 2026

Rate Limiting for AI APIs: Token Bucket vs Sliding Window vs Token Budget

Traditional API rate limiting counts requests. AI APIs need to count tokens. A single malicious request that consumes 100K tokens in one call is not caught by a "100 requests per minute" rule. Here is how to rate limit AI endpoints correctly.

Standard API rate limiting counts requests per time window. This is correct for REST APIs where requests have approximately uniform cost. It is wrong for AI APIs where a single request can consume anywhere from 10 to 200,000 tokens and cost proportionally.

The Three Algorithms

Token bucket

A bucket fills at a fixed rate and drains with each request. Allows bursting up to bucket capacity, then enforces the fill rate. Best for APIs where brief bursts are acceptable. Traditional implementation counts requests — for AI APIs, each "token" in the bucket should represent a unit of compute, not a request.

Sliding window

Tracks requests in a rolling time window. More accurate than fixed windows (avoids the "double limit" problem at window boundaries) but more memory-intensive. Again, must be adapted to count tokens rather than requests for AI endpoints.

Token budget (AI-specific)

The correct abstraction for AI APIs is a token budget: a per-session, per-key, or per-org limit on total tokens consumed in a time period. A budget of 500K tokens per hour is meaningful regardless of whether that was 5 large requests or 5,000 small ones.

Implementing Token Budget Rate Limiting

The challenge: you do not know the token count of a request until the model responds (for completions) or until you count the input tokens (which requires tokenizing the input before forwarding). Practical approach: count input tokens before forwarding, estimate output tokens based on max_tokens parameter, deduct from budget pessimistically, refund unused estimate after response.

Why This Matters for Security

Without token-aware rate limiting, a single authenticated user can exhaust your API budget — and depending on your pricing model, generate significant cost — with a handful of requests. This is a denial-of-service vector that bypasses traditional rate limiting completely. It is also how prompt injection attacks can amplify their cost impact.

ShareX / TwitterLinkedIn

Ready to secure your AI stack?

14-day free trial — full platform access, no credit card required. Early access members get pricing locked in forever.