We publish our performance targets, test methodology, and CI gate thresholds. Every pull request must pass p95 < 200ms at 50 VU before it can merge. Here is exactly what G8KEPR adds to your request path — and why it stays flat as load scales.
Context matters. The CI gate runs at 50 VU on the demo stack (single-region cloud · 2 vCPU · 4GB RAM). All test scripts are in the repo — run them yourself.
Real measurements, not aspirational targets — every number below ties to a check-in CI gate or a documented test run.
These thresholds are defined in tests/load/k6-ci.js and block merges if any threshold fails.
Sanity check — single-user baseline, tightest thresholds
Authoritative baseline — must pass to merge to main
Burst capacity — proves the system holds under spike load
Published targets from tests/load/benchmark_results.json
GET /healthReadiness probe
POST /api/keys/validateAPI key authentication
GET /api/gateway/proxyGateway passthrough
GET /api/compliance/frameworksCompliance data fetch
GET /api/audit-logsAudit log retrieval
POST /api/threat-intelligence/analyzeFull AI threat scan
Threat detection p95 is higher because it runs optional ML analysis. It is async and does not block the gateway response.
The detection engine is a four-tier cascade. Cheap tiers run first; expensive tiers escalate only when confidence is low. The fast path stays under 25ms on every request.
~20 pre-compiled signatures. SQLi, XSS, path traversal, command chaining. High-confidence (>0.9) hits block immediately.
SentenceTransformer cosine similarity vs. known injection vectors. Catches semantic equivalents that bypass literal regex. Lazy-loaded model.
Intent classification gate. Plus base64 / URL / hex decode-then-rescan, NFKD normalization, character-split detection, Shannon entropy.
High-uncertainty inputs only. Routed via the AI Gateway to an LLM provider. Async — fail-open on timeout to avoid blocking response.
< 25ms regex-only · < 40ms with ML scoring
100% precision · 87.96% recall · 0 false positives on 241-sample corpus
Every request goes through only the steps it needs. Pass-through validation adds ~15–30ms. Active threat scanning adds 50–200ms but runs async.
In-memory JWT signature check + Redis GET for key scope
Redis INCR with sliding window — single round-trip
In-memory state check — no I/O
Redis semantic cache hit — no model call needed
Pattern matching + optional ML scoring. Async — does not block response.
Async PostgreSQL insert — does not add to request latency
Pass-through total: ~15–30ms added overhead for JWT validation + rate limit + circuit breaker. No more.
Threat analysis: Runs async after the response is sent. Your users do not wait for the ML model.
Performance is a first-class design constraint, not an afterthought. These are the specific choices that give G8KEPR its overhead profile.
All endpoints are async/await — no thread blocking on I/O. Single worker can handle hundreds of concurrent connections without spawning threads.
Identical concurrent requests to Redis are coalesced — only one cache miss fires even under burst. Semantic cache prevents re-running ML models for repeated prompt patterns.
PostgreSQL connections are pooled via asyncpg — no per-request connect overhead. Pool sizing is tuned per deployment based on DB capacity.
API responses are Brotli-compressed at the edge, reducing wire bytes by ~30% vs gzip. HTTP/3 eliminates head-of-line blocking for clients that support it.
These aren't tunables that need configuring after deployment — they're structural decisions that keep latency flat as load scales. None of them are present in a generic API gateway plus third-party AI guardrail stack.
Cheap tiers run first. 99% of traffic terminates at Tier 1 regex (<1ms). Only ambiguous inputs escalate to embeddings (5-15ms), NLI (<15ms), or LLM (rare). p99 stays <25ms regex-only, <40ms with ML.
modules/threat_detection/SHA-256 fingerprint per system prompt. After 2 identical prompts, auto-injects cache_control:ephemeral. Break-even at 2 calls (38% off). 88% off at 10 calls. Zero SDK changes.
gateway/cache_optimizer.pyStatistical baseline per provider per hour-of-day, not static thresholds. 4 rolling windows (1m / 5m / 15m / 1hr). Progressive recovery 10/25/50/100%. More resilient than Hystrix or Resilience4j.
gateway/router.pyIdentical concurrent requests are coalesced — only one cache miss fires under burst. Semantic cache prevents re-running ML for repeated patterns. Cache miss is always safe; every cache has a documented fallback path.
11 Redis caches · TTL fail-safeTrafficAnalyzer caches are size-capped (5K endpoints, 10K patterns) with auto-eviction on overflow. No memory leak vector — deliberately built to fail predictably under sustained load instead of OOM-spiraling.
_BoundedDict · OrderedDictBelow are real numbers from the documented test run, not estimates. Tests run in CI every night against the demo stack.
GET /healthReadiness probe
POST /auth/loginLegitimate request
POST /auth/login (SQLi)Threat detected + blocked at 403
Threat detection adds zero measurable latency vs. legitimate requests at single-user load (compare row 2 to row 3).
p50 130ms · p95 350ms · p99 1.3s
p95 19ms · p99 47ms
877 / 877 blocked as 403
99.71% pass rate across 2,424 concurrent requests · 7 failures total
100% pass across 3,632 hash-chained writes · 0 failures
API gateway overhead is mostly network RTT, not processing. Self-hosted eliminates that entirely.
Eleven distinct Redis caches with explicit TTLs and explicit fallbacks. Cache miss is always safe — Redis provides performance, not correctness. Every cache has a documented degraded path.
300sIn-memory LRU300sIn-memory dict300sDB query30sIn-memory LRU1 hourLLM API call5 minRouting algorithmUntil expiryFail-closed (reject)15 minFlow failsPer windowDegraded (warn)1 hourDB query60sDB queryPublic paths: max-age=60, stale-while-revalidate=300 · Private paths: max-age=30, must-revalidate · 304 Not Modified on hit
Singleton with max 50 connections · 5s connect timeout · TLS for remote hosts · background ping every 30s · Redis ops < 1ms typical
All test scripts are in the repository. No black-box benchmarks — every number on this page is reproducible. Run against the demo API or your own self-hosted deployment.
tests/load/k6-ci.jsCI gate — smoke + 50 VU load + 200 VU stress
tests/load/k6-baseline.jsAuthoritative baseline — 5 key endpoints at 50 VU
tests/load/k6-full.jsFull stress test — 0→500 VU ramp (staging only)
backend/tests/performance/benchmark_guard_performance.pySemanticGuard cache performance (pytest)
Production note: Self-hosted on production hardware (4+ vCPU, 8GB RAM) will significantly outperform these demo numbers. Contact us for production benchmark guidance.
All k6 scripts are in the repo. Run them against the demo API or your own deployment. If you want a production sizing conversation, we are happy to help.