Skip to main content
CI Gate: p95 < 200ms @ 50 VU

Performance
Benchmarks

We publish our performance targets, test methodology, and CI gate thresholds. Every pull request must pass p95 < 200ms at 50 VU before it can merge. Here is exactly what G8KEPR adds to your request path — and why it stays flat as load scales.

<200ms
p95 CI gate
100%
Precision
99.71%
Under 50-VU load
<1ms
Tier-1 regex

Context matters. The CI gate runs at 50 VU on the demo stack (single-region cloud · 2 vCPU · 4GB RAM). All test scripts are in the repo — run them yourself.

Demo (2 vCPU · 4GB)
50 VU · 100+ RPS stable
Production (4 vCPU · 8GB)
~500 VU · 250+ RPS
Kubernetes horizontal
2,000+ RPS (3-10 replicas, HPA 70% CPU)

Real measurements, not aspirational targets — every number below ties to a check-in CI gate or a documented test run.

< 1ms
Tier-1 regex
fast-path overhead
< 25ms
Fast-path p99
regex-only path
< 40ms
p99 with ML
embeddings + NLI
100%
CI precision
241-sample corpus
99.71%
Pass under load
2,424 concurrent
11
Redis caches
TTL + fail-safe
250+
RPS production
4 vCPU baseline
2,000+
RPS at scale
K8s 3-10 replicas
Enforced on Every Pull Request

CI Performance Gate

These thresholds are defined in tests/load/k6-ci.js and block merges if any threshold fails.

Smoke (1 VU · 30s)

Sanity check — single-user baseline, tightest thresholds

p95
< 100ms
p99
< 200ms
Throughput
> 50 req/s
Error rate
< 0.5%
Load (50 VU · 3 min)

Authoritative baseline — must pass to merge to main

p95
< 200ms
p99
< 500ms
Throughput
> 100 req/s
Error rate
< 1%
Stress (200 VU · 1 min)

Burst capacity — proves the system holds under spike load

p95
< 500ms
p99
< 1,000ms
Throughput
> 150 req/s
Error rate
< 5%

Per-Endpoint Performance Targets

Published targets from tests/load/benchmark_results.json

Endpoint
Category
p50 target
p95 target
GET /health

Readiness probe

Infrastructure
50ms
100ms
POST /api/keys/validate

API key authentication

Auth
50ms
200ms
GET /api/gateway/proxy

Gateway passthrough

Gateway
80ms
200ms
GET /api/compliance/frameworks

Compliance data fetch

Compliance
150ms
500ms
GET /api/audit-logs

Audit log retrieval

Audit
200ms
500ms
POST /api/threat-intelligence/analyze

Full AI threat scan

AI Analysis
300ms
1000ms

Threat detection p95 is higher because it runs optional ML analysis. It is async and does not block the gateway response.

Cascading Detection · Cheap Tiers First

Threat Detection Latency by Tier

The detection engine is a four-tier cascade. Cheap tiers run first; expensive tiers escalate only when confidence is low. The fast path stays under 25ms on every request.

T1
Regex Pattern Matching< 1 ms

~20 pre-compiled signatures. SQLi, XSS, path traversal, command chaining. High-confidence (>0.9) hits block immediately.

T2
Embedding Similarity5-15 ms

SentenceTransformer cosine similarity vs. known injection vectors. Catches semantic equivalents that bypass literal regex. Lazy-loaded model.

T3
NLI Zero-Shot Classification< 15 ms

Intent classification gate. Plus base64 / URL / hex decode-then-rescan, NFKD normalization, character-split detection, Shannon entropy.

T4
LLM Escalation (rare)200-2,000 ms

High-uncertainty inputs only. Routed via the AI Gateway to an LLM provider. Async — fail-open on timeout to avoid blocking response.

Fast path p99

< 25ms regex-only · < 40ms with ML scoring

Detection accuracy

100% precision · 87.96% recall · 0 false positives on 241-sample corpus

What G8KEPR Does With Those Milliseconds

The Request Pipeline

Every request goes through only the steps it needs. Pass-through validation adds ~15–30ms. Active threat scanning adds 50–200ms but runs async.

1
JWT + API key validation1–3ms

In-memory JWT signature check + Redis GET for key scope

2
Rate limit check2–5ms

Redis INCR with sliding window — single round-trip

3
Circuit breaker evaluation< 1ms

In-memory state check — no I/O

4
Threat detection (cached)5–10ms

Redis semantic cache hit — no model call needed

4b
Threat detection (uncached)50–200ms

Pattern matching + optional ML scoring. Async — does not block response.

5
Audit log write2–8ms

Async PostgreSQL insert — does not add to request latency

Pass-through total: ~15–30ms added overhead for JWT validation + rate limit + circuit breaker. No more.

Threat analysis: Runs async after the response is sent. Your users do not wait for the ML model.

Built for Throughput

Architecture Choices That Keep Overhead Low

Performance is a first-class design constraint, not an afterthought. These are the specific choices that give G8KEPR its overhead profile.

Async FastAPI

All endpoints are async/await — no thread blocking on I/O. Single worker can handle hundreds of concurrent connections without spawning threads.

Python asynciouvicornNo GIL blocking

Redis Singleflight Cache

Identical concurrent requests to Redis are coalesced — only one cache miss fires even under burst. Semantic cache prevents re-running ML models for repeated prompt patterns.

SingleflightSemantic dedup<10ms cache hit

Connection Pooling

PostgreSQL connections are pooled via asyncpg — no per-request connect overhead. Pool sizing is tuned per deployment based on DB capacity.

asyncpg poolNo reconnect overheadConfigurable size

Brotli + HTTP/3

API responses are Brotli-compressed at the edge, reducing wire bytes by ~30% vs gzip. HTTP/3 eliminates head-of-line blocking for clients that support it.

Brotli compressionHTTP/3 QUIC~30% smaller payloads
Where Generic Gateways Lose to G8KEPR Under Load

Performance Choices Built into the Architecture

These aren't tunables that need configuring after deployment — they're structural decisions that keep latency flat as load scales. None of them are present in a generic API gateway plus third-party AI guardrail stack.

4-Tier Cascading Detection

Cheap tiers run first. 99% of traffic terminates at Tier 1 regex (<1ms). Only ambiguous inputs escalate to embeddings (5-15ms), NLI (<15ms), or LLM (rare). p99 stays <25ms regex-only, <40ms with ML.

fast-path < 25msmodules/threat_detection/

Auto Anthropic Prompt Caching

SHA-256 fingerprint per system prompt. After 2 identical prompts, auto-injects cache_control:ephemeral. Break-even at 2 calls (38% off). 88% off at 10 calls. Zero SDK changes.

88% savings · 10 callsgateway/cache_optimizer.py

Adaptive Z-Score Breaker

Statistical baseline per provider per hour-of-day, not static thresholds. 4 rolling windows (1m / 5m / 15m / 1hr). Progressive recovery 10/25/50/100%. More resilient than Hystrix or Resilience4j.

3σ · 4 windowsgateway/router.py

Redis Singleflight Cache

Identical concurrent requests are coalesced — only one cache miss fires under burst. Semantic cache prevents re-running ML for repeated patterns. Cache miss is always safe; every cache has a documented fallback path.

burst-safe11 Redis caches · TTL fail-safe

Bounded LRU Memory

TrafficAnalyzer caches are size-capped (5K endpoints, 10K patterns) with auto-eviction on overflow. No memory leak vector — deliberately built to fail predictably under sustained load instead of OOM-spiraling.

5K + 10K caps_BoundedDict · OrderedDict
Verified Measurements · 2026-04-19 nightly replay

Actual Numbers, Not Targets

Below are real numbers from the documented test run, not estimates. Tests run in CI every night against the demo stack.

Single-user baseline
GET /health
avg
9.7ms
p95
10ms

Readiness probe

POST /auth/login
avg
11.7ms
p95
15.9ms

Legitimate request

POST /auth/login (SQLi)
avg
11.2ms
p95
12.5ms

Threat detected + blocked at 403

Threat detection adds zero measurable latency vs. legitimate requests at single-user load (compare row 2 to row 3).

50-user sustained load
Threat detection
103.6 RPS

p50 130ms · p95 350ms · p99 1.3s

Dashboard APIs
p50 9ms

p95 19ms · p99 47ms

SQLi block rate
100%

877 / 877 blocked as 403

Threat-analysis under load

99.71% pass rate across 2,424 concurrent requests · 7 failures total

Audit-log writes (concurrent)

100% pass across 3,632 hash-chained writes · 0 failures

Why G8KEPR Adds Less Overhead Than You Expect

API gateway overhead is mostly network RTT, not processing. Self-hosted eliminates that entirely.

Factor
Cloud-Only Gateway
G8KEPR Self-Hosted
Network latency to gateway
Add 20–80ms RTT — gateway is in another datacenter
Near-zero — deploy alongside your services
Authentication overhead
Remote auth call on every request — adds 50–200ms
In-process JWT check + Redis local lookup — 1–10ms
Rate limit check
Centralized rate-limit service — extra network hop
Local Redis — 2–5ms single round-trip
Threat analysis blocking
Synchronous scan blocks response until complete
Async — response sent before analysis finishes
Cold start / scale-out
Vendor cold starts add unpredictable spikes
Persistent uvicorn workers — no cold starts
Vendor lock-in on hardware
Cannot scale beyond vendor-provided instance types
Scale to your hardware — upgrade without migration
11 Caches · TTL-Based · Always Fail-Safe

How The Latency Is Achieved

Eleven distinct Redis caches with explicit TTLs and explicit fallbacks. Cache miss is always safe — Redis provides performance, not correctness. Every cache has a documented degraded path.

Cache
TTL
Threat detection results
300s
MCP tool permissions
300s
Agent rate limits
300s
AI Gateway provider health
30s
LLM prompt response cache
1 hour
Routing decisions
5 min
Token blacklist
Until expiry
OAuth state
15 min
2FA attempts
Per window
Org analytics summary
1 hour
Policy validation
60s
ETag two-tier caching

Public paths: max-age=60, stale-while-revalidate=300 · Private paths: max-age=30, must-revalidate · 304 Not Modified on hit

Pooled Redis client

Singleton with max 50 connections · 5s connect timeout · TLS for remote hosts · background ping every 30s · Redis ops < 1ms typical

Open Methodology

Run It Yourself

All test scripts are in the repository. No black-box benchmarks — every number on this page is reproducible. Run against the demo API or your own self-hosted deployment.

tests/load/k6-ci.js

CI gate — smoke + 50 VU load + 200 VU stress

tests/load/k6-baseline.js

Authoritative baseline — 5 key endpoints at 50 VU

tests/load/k6-full.js

Full stress test — 0→500 VU ramp (staging only)

backend/tests/performance/benchmark_guard_performance.py

SemanticGuard cache performance (pytest)

Test Environment

Demo serverSingle-region cloud · 2 vCPU · 4GB RAM
Test toolk6 (Grafana) + pytest-benchmark
BackendFastAPI + asyncpg + Redis 7 + PostgreSQL 15
Load scenariosSmoke (1 VU) · Load (50 VU) · Stress (200 VU)
Baseline VU count50 VU sustained for 2 minutes
FrequencyBlocking CI gate on every pull request

Production note: Self-hosted on production hardware (4+ vCPU, 8GB RAM) will significantly outperform these demo numbers. Contact us for production benchmark guidance.

Open Benchmarks

Performance questions? Run the tests yourself.

All k6 scripts are in the repo. Run them against the demo API or your own deployment. If you want a production sizing conversation, we are happy to help.