Standard API caching uses exact key matching: the same URL with the same parameters returns the same cached response. For AI APIs, this is too restrictive — two prompts that phrase the same question differently are semantically equivalent but not lexically identical. Semantic caching bridges this gap by using embedding similarity to match semantically equivalent prompts to cached responses.
How It Works
When a request arrives, compute an embedding of the prompt and search a vector index for similar cached prompts. If a prompt with embedding similarity above a threshold (typically 0.92-0.95 cosine similarity) exists in cache, return the cached response without invoking the model. Otherwise, call the model, cache the response with its embedding, and return it.
async def semantic_cache_lookup(prompt: str, threshold: float = 0.93) -> str | None:
embedding = await embed(prompt)
results = await vector_db.query(
vector=embedding,
top_k=1,
include_metadata=True
)
if results and results[0].score >= threshold:
return results[0].metadata["cached_response"]
return None # Cache miss — call the modelThe Security Implications
Semantic caching introduces a new attack surface: cache poisoning via semantic similarity. An attacker who can craft a prompt that is semantically similar to a legitimate query, but close enough to a cached malicious response, can poison the cache. The cached response from one session may be returned to a different user in a different session.
Mitigations: scope the cache per user or per organization (never return responses cached from other organizations), apply output validation to cached responses before returning them (a cached response that now fails validation should be evicted), and maintain cache TTLs that limit the window for a poisoned entry.
Where Semantic Caching Breaks Down
Semantic caching works well for stable knowledge queries (FAQ responses, documentation lookups, product information). It breaks down for: queries where recency matters (news, current events), personalized responses that depend on user context, and queries where small semantic differences have large output differences (calculations, code generation).
Start with a conservative similarity threshold (0.95+) and lower it gradually as you build confidence in your cache quality. A false positive cache hit — returning the wrong answer — is worse than a cache miss.
