KV Cache Visualizer

Real-time inference simulation: prefill → decode → streaming → eviction

Mode
Inference Tick: 0
Phase: Prefill (parallel)
Eviction Policy
Evicts oldest tokens when cache is full (like Mistral, Llama 3.1).
Prompt
Mode:
Speed
Prefill: 0/7 tokens
GPU: KV Cache Blocks
Paged KV Storage
Prefill: process all tokens sequentially
Block 1Capacity 0/8
Empty
Empty
Empty
Empty
Empty
Empty
Empty
Empty
Block 2Capacity 0/8
Empty
Empty
Empty
Empty
Empty
Empty
Empty
Empty
Block 3Capacity 0/8
Empty
Empty
Empty
Empty
Empty
Empty
Empty
Empty
Block 4Capacity 0/8
Empty
Empty
Empty
Empty
Empty
Empty
Empty
Empty
Legend
New KV (written this step)
Reused (attention reads)
Pinned KV (locked in prefix)
Evicted KV (fading out)
Empty slot
gen = generated decode token (simulated, not real inference)
Prefill: Process all prompt tokens sequentially.
Decode: Read cached KV → generate next token → write new KV.
Policy Impact:
Sliding Window: Evicts oldest on overflow
Pinned Prefix: Locks first block (system prompt)
Recent-N: Keeps only recent tokens
KV Cache Blocks
0/32 slots used
New
0
Reused (attention reads)
0
Empty
32
How Policies Map to Real LLM Serving
Inference Loop
  • Inference clock: one tick = one forward pass through the model.
  • Prefill phase: batch prompt tokens, allocate KV, fill initial blocks.
  • Decode phase: read cached KV, generate next token, append KV to last block.
Policy Behavior
  • Sliding window: evict oldest tokens when cache is full (long-context models).
  • Pinned prefix: protect the system prompt; reduces eviction risk for core instructions.
  • Recent-N: keep only recent history for low-latency streaming.
Operational Impact
  • Latency: more reuse → faster decode; more eviction → more recompute.
  • Throughput: batching improves compute utilization, not memory sharing.
  • Recall: larger retained window improves grounding but raises memory pressure.
When to Use What

Sliding window for long contexts, pinned prefix for stable system prompts,recent-N for streaming. Each policy trades memory usage for context recall.