SemBlend Fleet Demo — NVIDIA CAGRA + Dynamo/NIM

NVIDIA CAGRA cuVS GPU ANN NIM-Compatible LIVE

Fleet Router Endpoint

TTFT Performance — Time to First Token

Cold Prefill

No cache — full computation

milliseconds P50

1.0x baseline

LMCache Exact Match

Exact 256-token chunk match

milliseconds P50

—

SemBlend Semantic

CAGRA donor lookup + KV reuse

milliseconds P50

—

TTFT Comparison

Cold Prefill

LMCache (exact)

SemBlend (semantic)

Mixed Production

Fleet Architecture — Additive to Dynamo's KV-Aware Routing

Additive to Dynamo — 3-Path Routing

Dynamo RadixTree exact prefix (0.1ms)
Dynamo native — unchanged. Exact token match → route to prefix worker

SemBlend semantic search (embed ~3ms + CAGRA <1ms)
Only on Dynamo miss. cuVS CAGRA: brute@N<64, graph@N≥64

Route to donor-holding worker (sim ≥ 0.50)
Worker reuses donor KV cache → skip up to 74% of prefill

Cold fallback — Dynamo native (round-robin)
Both miss → least-loaded worker. SemBlend adds zero overhead here

Strictly additive: SemBlend never replaces Dynamo's routing. It only catches requests that Dynamo would send to cold fallback — turning misses into semantic hits with zero degradation to existing paths.

Routing Stats

Total Requests

Semantic Routed

Cold Fallback

Avg Route (ms)

Worker Fleet — NVIDIA A10G GPUs

Fleet Index

Donor Entries

Search Backend

Avg Search (ms)

Workers

CAGRA Scaling — GPU ANN vs CPU Brute-Force

Search Latency vs Fleet Donor Pool Size (100 queries, 384-dim MiniLM embeddings)

numpy CPU

cuVS brute_force (GPU)

cuVS CAGRA (GPU ANN)

Key insight: CAGRA search stays near-constant (~1.3ms) as the donor pool grows from 100 to 20K, while numpy CPU grows linearly. At production scale with 100K+ donors across the fleet, CAGRA provides <2ms routing decisions — critical for real-time fleet-level semantic KV routing.

Additive to Dynamo: SemBlend extends Dynamo's existing RadixTree router — it only activates when exact-prefix matching misses. Semantic hit rate is workload-dependent: highly repetitive workloads (RAG, customer support) see 30-50% additional hits; diverse workloads see 10-20%. Dynamo's native paths are never replaced or degraded.

Technology Stack

SemBlend

Semantic KV-cache reuse.
Finds semantically similar past requests (donors) and reuses their computed KV cache.
5.5-12x TTFT speedup.

NVIDIA CAGRA

GPU-accelerated ANN index (cuVS).
Sub-ms vector search at any scale. Auto-adaptive: brute-force@N<64, CAGRA graph@N≥64.
<2ms at 100K+ donors.

NIM-Compatible

OpenAI-compatible API. Drop-in replacement for NIM inference.
vLLM v0.14.1 + LMCache + SemBlend connector.
Zero API changes needed.

Fleet Routing

Additive to Dynamo's KV-aware router. Catches misses that exact-prefix matching can't.
NATS JetStream event plane. MiniLM embedding. Per-tenant affinity.
10-50% additional hit rate (workload-dependent).

Execution Log

Ready. Click "Run Fleet Benchmark" to start, or "Refresh Stats" to see live fleet state.