KV Cache Visualizer
Real-time inference simulation: prefill → decode → streaming → eviction
Mode
Inference Tick: 0
Phase: Prefill (parallel)
Eviction Policy
Evicts oldest tokens when cache is full (like Mistral, Llama 3.1).
Prompt
Mode:
Speed1×
Prefill: 0/7 tokens
GPU: KV Cache Blocks
Paged KV Storage
Prefill: process all tokens sequentially
Block 1Capacity 0/8
Block 2Capacity 0/8
Block 3Capacity 0/8
Block 4Capacity 0/8
Legend
New KV (written this step)
Reused (attention reads)
Pinned KV (locked in prefix)
Evicted KV (fading out)
Empty slot
gen = generated decode token (simulated, not real inference)
Prefill: Process all prompt tokens sequentially.
Decode: Read cached KV → generate next token → write new KV.
Policy Impact:
• Sliding Window: Evicts oldest on overflow
• Pinned Prefix: Locks first block (system prompt)
• Recent-N: Keeps only recent tokens
Decode: Read cached KV → generate next token → write new KV.
Policy Impact:
• Sliding Window: Evicts oldest on overflow
• Pinned Prefix: Locks first block (system prompt)
• Recent-N: Keeps only recent tokens
KV Cache Blocks
0/32 slots used
New
0
Reused (attention reads)
0
Empty
32
How Policies Map to Real LLM Serving
Inference Loop
- Inference clock: one tick = one forward pass through the model.
- Prefill phase: batch prompt tokens, allocate KV, fill initial blocks.
- Decode phase: read cached KV, generate next token, append KV to last block.
Policy Behavior
- Sliding window: evict oldest tokens when cache is full (long-context models).
- Pinned prefix: protect the system prompt; reduces eviction risk for core instructions.
- Recent-N: keep only recent history for low-latency streaming.
Operational Impact
- Latency: more reuse → faster decode; more eviction → more recompute.
- Throughput: batching improves compute utilization, not memory sharing.
- Recall: larger retained window improves grounding but raises memory pressure.
When to Use What
Sliding window for long contexts, pinned prefix for stable system prompts,recent-N for streaming. Each policy trades memory usage for context recall.