The Problem
Our agent answers questions about product documentation. Users ask the same questions differently:
- "How do I reset my password?"
- "What's the password reset process?"
- "I forgot my password, how do I change it?"
All three hit the LLM. Same answer, three separate API calls. Three separate charges.
Exact-match caching doesn't help because the queries aren't identical. We needed something smarter.
Enter Semantic Caching
Instead of matching exact strings, semantic caching matches meaning.
How it works:
- Generate embedding for incoming query
- Search for similar queries in cache (cosine similarity)
- If similarity > threshold (e.g., 0.95), return cached response
- Otherwise, call LLM and cache the result
Example:
`Query 1: "How do I reset my password?"
→ No cache hit
→ Call LLM ($0.002)
→ Cache: embedding + response
Query 2: "What's the password reset process?"
→ Similarity: 0.97 (cache hit!)
→ Return cached response ($0.000)
→ Saved: $0.002, 800ms latency
Query 3: "I forgot my password, help?"
→ Similarity: 0.96 (cache hit!)
→ Return cached response ($0.000)
→ Saved: $0.002, 800ms latency`
Real Results
Our production numbers (30 days):
`Total requests: 45,000
Cache hits: 18,000 (40%)
Cost saved: $76
Latency saved: ~14,400 seconds
Average response time:
├─ Cache miss: 1.2s
└─ Cache hit: 0.05s (24x faster)`
40% cache hit rate might not sound impressive, but that's 40% of requests that are instant and free.
When Semantic Caching Works
Great for:
- Documentation Q&A (same questions, different wording)
- Customer support (common issues asked repeatedly)
- Code explanation (similar code patterns)
- Translation tasks (same phrases)
Not great for:
- Queries requiring current data ("today's weather")
- Personalized responses (user-specific context)
- Creative generation (want variety, not cached outputs)
- Low-traffic endpoints (not enough queries to benefit)
Implementation in Bifrost
Bifrost has semantic caching built in. Here’s how to enable it.
1. Configure caching
{
"cache": {
"enabled": true,
"similarity_threshold": 0.95,
"ttl_seconds": 3600,
"embedding_model": "text-embedding-3-small"
}
}
2. That’s it
The gateway automatically:
- Generates embeddings for incoming queries
- Searches the cache for semantically similar queries
- Returns cached responses when similarity exceeds the threshold
- Updates the cache with new responses on cache misses
Dashboard visibility
The Bifrost dashboard shows:
- Cache hit rate
- Cost savings
- Latency improvements
The Tradeoffs
Pros
- Massive cost savings (~40% in our case)
- Much faster responses (~24× faster in our case)
- Zero application code changes required
Cons
- Embedding generation adds ~50 ms latency on cache misses
- Cache storage costs (minimal — embeddings are small)
- Potential for stale responses if underlying data changes frequently
Cost Breakdown
Cost calculation:
├─ Embedding: $0.00002 per query
├─ LLM call: $0.00200 per query
└─ Savings per cache hit: $0.00198
At a 40% cache hit rate (18,000 hits):
18,000 × $0.00198 = $35.64 saved
Embedding costs for all queries (45,000 total):
45,000 × $0.00002 = $0.90
Net savings:
$35.64 − $0.90 = $34.74 saved per 45k queries
Even accounting for embedding costs, the savings are substantial.
Tuning the Similarity Threshold
Similarity threshold selection is critical:
0.90 → High hit rate, higher risk of incorrect cache hits
0.95 → Balanced (our recommended default)
0.98 → Safer, but lower hit rate
Our test results
0.90 → 52% hit rate, ~3% incorrect cache hits
0.95 → 40% hit rate, ~0.5% incorrect cache hits
0.98 → 28% hit rate, ~0.1% incorrect cache hits
Recommendation:
Start at 0.95, then tune based on your accuracy and freshness requirements.
Cache Invalidation
Time-based (TTL):
Set expiration time. Good for data that changes predictably.
json
{
"ttl_seconds": 3600 *// 1 hour*
}
Manual invalidation:
Clear cache when you update documentation or data sources.
bash
curl -X POST http://localhost:8080/cache/clear
Selective clearing:
Tag cache entries by topic, clear specific topics when updated.
Try It Yourself
Bifrost is open source and MIT licensed:
bash
git clone https://github.com/maximhq/bifrost
cd bifrost
docker compose up
Enable semantic caching in the UI settings. Monitor the dashboard to see cache hit rates and cost savings in real-time.
Full implementation details in the GitHub repo.
The Bottom Line
Semantic caching is the easiest optimization we've implemented:
- Zero code changes
- 40% cost reduction
- 24x faster responses on cache hits
If you're making repeated LLM calls with similar queries, semantic caching pays for itself immediately.
Built by the team at Maxim AI. We also build evaluation and observability tools for AI agents.
Top comments (0)