KV Cache Visualizer

Real-time inference simulation: prefill → decode → streaming → eviction

Mode

Inference Tick: 0

Phase: Prefill (parallel)

Eviction Policy

Evicts oldest tokens when cache is full (like Mistral, Llama 3.1).

Prompt

Mode:

Speed1×

Prefill: 0/7 tokens

GPU: KV Cache Blocks

Paged KV Storage

Prefill: process all tokens sequentially

Block 1Capacity 0/8

Empty

Block 2Capacity 0/8

Empty

Block 3Capacity 0/8

Empty

Block 4Capacity 0/8

Empty

Legend

New KV (written this step)

Reused (attention reads)

Pinned KV (locked in prefix)

Evicted KV (fading out)

Empty slot

gen = generated decode token (simulated, not real inference)

Prefill: Process all prompt tokens sequentially.
Decode: Read cached KV → generate next token → write new KV.
Policy Impact:
• Sliding Window: Evicts oldest on overflow
• Pinned Prefix: Locks first block (system prompt)
• Recent-N: Keeps only recent tokens

KV Cache Blocks

0/32 slots used

New

Reused (attention reads)

Empty

How Policies Map to Real LLM Serving

Inference Loop

Inference clock: one tick = one forward pass through the model.
Prefill phase: batch prompt tokens, allocate KV, fill initial blocks.
Decode phase: read cached KV, generate next token, append KV to last block.

Policy Behavior

Sliding window: evict oldest tokens when cache is full (long-context models).
Pinned prefix: protect the system prompt; reduces eviction risk for core instructions.
Recent-N: keep only recent history for low-latency streaming.

Operational Impact

Latency: more reuse → faster decode; more eviction → more recompute.
Throughput: batching improves compute utilization, not memory sharing.
Recall: larger retained window improves grounding but raises memory pressure.

When to Use What

Sliding window for long contexts, pinned prefix for stable system prompts,recent-N for streaming. Each policy trades memory usage for context recall.