NVIDIA CAGRA
cuVS GPU ANN
NIM-Compatible
LIVE
TTFT Performance — Time to First Token
Cold Prefill
No cache — full computation
--
milliseconds P50
1.0x baseline
LMCache Exact Match
Exact 256-token chunk match
--
milliseconds P50
—
SemBlend Semantic
CAGRA donor lookup + KV reuse
--
milliseconds P50
—
TTFT Comparison
Fleet Architecture — Additive to Dynamo's KV-Aware Routing
Additive to Dynamo — 3-Path Routing
1
Dynamo RadixTree exact prefix (0.1ms)
Dynamo native — unchanged. Exact token match → route to prefix worker
2
SemBlend semantic search (embed ~3ms + CAGRA <1ms)
Only on Dynamo miss. cuVS CAGRA: brute@N<64, graph@N≥64
3
Route to donor-holding worker (sim ≥ 0.50)
Worker reuses donor KV cache → skip up to 74% of prefill
F
Cold fallback — Dynamo native (round-robin)
Both miss → least-loaded worker. SemBlend adds zero overhead here
Strictly additive: SemBlend never replaces Dynamo's routing. It only catches requests that Dynamo would send to cold fallback — turning misses into semantic hits with zero degradation to existing paths.
Worker Fleet — NVIDIA A10G GPUs
CAGRA Scaling — GPU ANN vs CPU Brute-Force
Search Latency vs Fleet Donor Pool Size (100 queries, 384-dim MiniLM embeddings)
Key insight: CAGRA search stays near-constant (~1.3ms) as the donor pool grows from 100 to 20K,
while numpy CPU grows linearly. At production scale with 100K+ donors across the fleet,
CAGRA provides <2ms routing decisions — critical for real-time fleet-level semantic KV routing.
Additive to Dynamo: SemBlend extends Dynamo's existing RadixTree router — it only activates when
exact-prefix matching misses. Semantic hit rate is workload-dependent: highly repetitive workloads (RAG, customer support)
see 30-50% additional hits; diverse workloads see 10-20%. Dynamo's native paths are never replaced or degraded.
SemBlend
Semantic KV-cache reuse.
Finds semantically similar past requests (donors) and reuses their computed KV cache.
5.5-12x TTFT speedup.
NVIDIA CAGRA
GPU-accelerated ANN index (cuVS).
Sub-ms vector search at any scale. Auto-adaptive: brute-force@N<64, CAGRA graph@N≥64.
<2ms at 100K+ donors.
NIM-Compatible
OpenAI-compatible API. Drop-in replacement for NIM inference.
vLLM v0.14.1 + LMCache + SemBlend connector.
Zero API changes needed.
Fleet Routing
Additive to Dynamo's KV-aware router. Catches misses that exact-prefix matching can't.
NATS JetStream event plane. MiniLM embedding. Per-tenant affinity.
10-50% additional hit rate (workload-dependent).
Ready. Click "Run Fleet Benchmark" to start, or "Refresh Stats" to see live fleet state.