How vLLM Actually Works: I Built It From Scratch So You Don't Have To

A deep dive into LLM inference -- from a single character to serving millions of requests. With diagrams, code, real benchmarks, and the intuition behind every design decision.

When I first started learning about LLM serving, every explanation felt either too shallow ("vLLM uses PagedAttention!") or too dense (reading the actual paper with its proofs and math notation). The blog posts gave me vocabulary -- "KV cache," "continuous batching," "PagedAttention" -- but not understanding.

So I did something different. I rebuilt the entire vLLM inference pipeline from scratch. Not by reading the source code, but by implementing every component myself: tokenizer exploration, multi-head attention, KV caching, a block manager, a scheduler. Fourteen files of Python across five phases, each producing something runnable and measurable.

Then I ran real vLLM on a GPU and compared. Every prediction matched.

This is what I learned. By the end, you'll understand exactly what happens from the moment you type a prompt to the moment you see the response streaming back -- and why vLLM can serve 6x more users than naive approaches on the same GPU.

Tokenization: Text Becomes Numbers
Embeddings: Numbers Become Meaning
Attention: How Tokens Relate to Each Other
The KV Cache: Speed vs Memory
PagedAttention: vLLM's Core Innovation
The Scheduler: Serving 1000 Users on 1 GPU
Putting It All Together
Key Takeaways

1. Tokenization: Text Becomes Numbers

Imagine a translator converting English into Morse code. They don't go letter by letter (too slow), and they don't try to encode whole sentences (too rigid). Instead, they break text into efficient chunks -- common words stay whole, rare words get split into pieces. That's tokenization.

GPUs do math on numbers, not words. The first step in LLM inference is converting your text into token IDs -- integers that index into a vocabulary the model learned during training.

GPT-2 uses Byte-Pair Encoding (BPE) with a vocabulary of 50,257 tokens. The algorithm looked at the entire training dataset and found the most efficient set of subword pieces to represent it all. Common words are single tokens; rare words get split:


tokenizer.encode("Hello")           # [15496]          -- 1 token
tokenizer.encode("unbelievable")    # [403, 6667, 11203, 540]  -- 4 tokens
                                    #  "un"  "bel"  "iev"  "able"

A token is roughly 4 characters on average. But this varies dramatically by content type:

Content Type	Chars per Token	Implication
English prose	5.1	Most efficient
Technical text	3.9	Slightly more tokens
Python code	2.7	Almost 2x more tokens than English!

This matters more than people realize. When someone says "this model supports 128K context," that's 128K tokens -- roughly 512K characters of English, but only 345K characters of code. Your effective context window is almost half the size for code.

Why this matters for cost: Every token consumes GPU memory and compute. Code-heavy workloads use ~2x more tokens per character than English, doubling the effective cost of serving those requests. Understanding token density is the first step to understanding serving costs.

2. Embeddings: Numbers Become Meaning

Think of a map where cities close in meaning are close in space. "King" and "queen" are neighbors. "Dog" and "cat" are in the same district. "King" and "banana" are on different continents. Except this map has 768 dimensions instead of 2.

Token ID 15496 for "Hello" is just an arbitrary number. The model needs something richer -- a vector that captures semantic meaning. That's the embedding table.

GPT-2's embedding table (wte) is a matrix of shape [50,257 x 768] -- one row of 768 numbers for each token in the vocabulary. These vectors were learned during training, and they encode rich semantic relationships:


# Cosine similarity between embedding vectors:
sim("king", "queen")  = 0.81   # both royalty -- very similar!
sim("man", "woman")   = 0.64   # both people
sim("dog", "cat")     = 0.55   # both animals
sim("king", "banana") = 0.26   # unrelated

Nobody programmed these. The model learned them from billions of words.

Position Matters

There's a second embedding table: positional embeddings (wpe), shape [1,024 x 768]. Without it, "dog bites man" and "man bites dog" would look identical to the model -- same tokens, same vectors.

The positional embedding encodes where each token sits in the sequence. Position 0 gets one vector, position 1 gets a different one, and so on up to position 1,023 (GPT-2's max context length).

The final input to the transformer is:


input = token_embedding[token_id] + position_embedding[position]

For "Make me sound smart" (4 tokens), this produces a [4 x 768] matrix -- 3,072 floating-point numbers. That matrix is what goes on the GPU. I built this manually and verified it matches GPT-2's internal computation with zero error.

The GPU Memory Equation

This is the equation that governs everything in LLM serving:


GPU Memory = Model Weights + KV Cache (your tokens)

Llama 8B in FP16 takes 14.9 GB. On a 16 GB T4, that leaves only 1.1 GB for tokens -- barely room for a handful of concurrent users. On an 80 GB A100, you get 65.1 GB. With FP8 quantization on an H100 (half the model size), you get 72.5 GB.

Why this matters for cost: GPU Memory = Model + KV Cache. Shrink the model (quantize from FP16 to FP8 or INT4) and you get more room for KV cache, which means more concurrent users, which means lower cost per request. This is why Nvidia's H100 was so popular -- FP8 support doubled the effective capacity. And FP4 on Blackwell will double it again.

3. Attention: How Tokens Relate to Each Other

Imagine a classroom where students can pass notes to each other, but there's one rule: you can only pass notes to students sitting in front of you -- never behind. Each student writes a question (Query), advertises what they know (Key), and shares their notes (Value). The students sitting in front of you whose knowledge matches your question get the most attention.

We now have a [4 x 768] matrix on the GPU. But each token's vector is independent -- "smart" doesn't know that "sound" came before it. The model needs to figure out: when predicting the next token, which previous tokens are important?

Q, K, V: Three Roles

The input matrix gets multiplied by three different learned weight matrices:

Q (Query): "What am I looking for?" -- each token's search query
K (Key): "What do I contain?" -- each token's label/advertisement
V (Value): "Here's my actual information" -- the content to retrieve

The attention formula:


attention = softmax(Q x K^T / sqrt(d_k)) x V

Let me trace through with real numbers. For "I love dogs" with simplified 3-dim vectors:

Step 1: Q x K^T -- every token scores against every other:


Score matrix:        "I"    "love"   "dogs"
"I" asks:          [  6      3        3   ]   ← "I" cares most about itself
"love" asks:       [  5      10       7   ]   ← "love" cares most about "love"
"dogs" asks:       [  5      5        4   ]   ← "dogs" cares about "I" and "love" equally

Step 2: / sqrt(d_k) -- divide by square root of dimension to prevent extreme scores.

Step 3: Causal mask -- this is critical. Each token can only see past tokens:


Mask:  [1  0  0]    "I" can ONLY see itself
       [1  1  0]    "love" sees "I" and itself
       [1  1  1]    "dogs" sees everything

Where mask = 0, score becomes -infinity → softmax turns it to 0%

This is why LLMs generate one token at a time. During generation, future tokens don't exist yet. The mask prevents the model from cheating.

Step 4: softmax x V -- convert scores to percentages, blend information:


"dogs" attention weights: 38% on "I", 38% on "love", 24% on itself
output_for_dogs = 0.38 × V["I"] + 0.38 × V["love"] + 0.24 × V["dogs"]

Multi-Head: 12 Perspectives at Once

One attention head can only learn one type of relationship. GPT-2 runs 12 heads in parallel, each with 64 dimensions (12 x 64 = 768), each learning different patterns:

Head 0 might learn subject-verb relationships. Head 1 might learn adjective-noun patterns. Head 5 might learn coreference ("he" refers to "John"). All 12 compute independently, then their outputs are concatenated and mixed.

The Full Transformer Block

One block is: LayerNorm → Attention (12 heads) → Residual → LayerNorm → MLP (768→3072→768) → Residual

The MLP expands to 4x width for "thinking" then compresses back. The residual connection (output = input + attention(input)) prevents information loss through deep networks. GPT-2 stacks 12 of these blocks. After all 12, a projection to vocabulary size (50,257) produces the next-token prediction.

Why this matters for cost: Llama 70B has 80 layers with 64 attention heads each. Every single token generation requires attention across all heads in all layers. This is enormous compute -- and it happens for every token, for every user, continuously. The attention mechanism IS the compute cost.

4. The KV Cache: Speed vs Memory

Imagine two students taking an exam. Student A takes meticulous notes -- every time they learn something, they write it down. When a new question comes, they glance at their notes and answer quickly. Student B refuses to take notes -- for every question, they re-read the entire textbook from page 1. Same answers, but Student A finishes 50x faster.

During generation, we produce tokens one at a time. At each step, the new token's Query needs to attend to ALL previous tokens' Keys and Values. But Keys and Values for previous tokens don't change -- they depend only on the token's embedding and the weight matrices, both of which are fixed.

Without caching, we recompute K and V for every previous token at every step:


WITHOUT cache (naive):
  Step 1:   compute Q,K,V for 2 tokens
  Step 2:   compute Q,K,V for 3 tokens (2 REDUNDANT!)
  Step 100: compute Q,K,V for 101 tokens (100 REDUNDANT!)
  Total work: 2+3+4+...+101 = ~5,000 computations. O(n²)

WITH cache:
  Step 1:   compute Q,K,V for 2 tokens. SAVE K,V to cache.
  Step 2:   compute Q,K,V for 1 NEW token. Append to cache.
  Step 100: compute Q,K,V for 1 NEW token. Append to cache.
  Total work: 2 + 1×100 = 102 computations. O(n)

I implemented both and measured on GPT-2: 3.1x speedup, identical output. For larger models with longer sequences, the gap grows to 10-100x.

Two Phases, Two Metrics

LLM generation has two distinct phases:

PREFILL: Process the entire prompt at once. This is a matrix-matrix multiplication (compute-bound). The time this takes = TTFT (Time to First Token). It grows with prompt length.

DECODE: Generate tokens one at a time, using the KV cache. This is a matrix-vector multiplication (memory-bandwidth-bound). The time per token = ITL (Inter-Token Latency). It's roughly constant regardless of sequence position.

From my benchmarks on our from-scratch GPT-2:


Prompt Length    TTFT        Avg ITL
5 tokens         27 ms       8.1 ms
20 tokens        24 ms       9.6 ms
45 tokens        40 ms       9.9 ms

TTFT grows with ISL. ITL stays constant. Exactly as predicted.

Why this matters for cost: KV cache trades memory for speed. 100 concurrent users with 4096-token sequences on GPT-2 alone consumes 15 GB of KV cache. For Llama 70B with 128K context, the numbers are staggering. This memory pressure is THE problem that vLLM solves.

5. PagedAttention: vLLM's Core Innovation

Imagine a hotel. The old hotel reserves an entire floor (2048 rooms) for every guest at check-in, even though most guests only use 200 rooms. 90% of rooms sit empty. The new hotel (vLLM) gives you exactly the rooms you need, scattered across different floors, and hands you a key card that maps your room numbers to physical locations. Less than 1% waste.

This is the key insight from the 2023 paper by Kwon et al. that made vLLM famous. Traditional KV cache allocation is catastrophically wasteful.

The Problem: 81% Memory Waste

Traditional approach: when a request arrives, reserve max_seq_len (e.g., 2048 slots) of contiguous GPU memory for its KV cache. If the actual sequence is only 200 tokens, 90% is wasted.

Across realistic workloads with varying sequence lengths, I measured 81% total memory waste with contiguous allocation. And because allocations must be contiguous, you also get fragmentation -- even if enough total free memory exists, it might not be in one contiguous chunk.

The Solution: OS-Style Memory Paging

Instead of one big contiguous block, vLLM breaks GPU memory into small fixed-size blocks (typically 16 tokens each). Blocks are allocated on demand as the sequence grows. When the sequence finishes, blocks return to the free pool.

The data structures map directly to how your operating system manages RAM:

vLLM Component	OS Equivalent	What It Does
PhysicalBlock	Memory Page	Holds KV data for 16 tokens
BlockAllocator	Free Page List	O(1) allocate and free
BlockTable	Page Table	Maps logical → physical blocks per request
BlockManager	Memory Manager	Orchestrates across all requests

The blocks don't need to be adjacent. Request A might use physical blocks [3, 7, 1, 9] -- completely scattered. The block table maps the logical order to physical locations, just like a page table in your OS.

Key insight: blocks are all the same size, so any free block can go to any request. No fragmentation.

From my from-scratch benchmark:


                    Contiguous    Paged (block_size=16)
Total allocated:    20,480        3,792
Actually used:       3,765        3,765
WASTED:             16,715 (81%)  27 (0.7%)
Max concurrent:      4 requests   26 requests

Why this matters for cost: 81% waste → <1% waste. 4 concurrent users → 26 concurrent users on the same GPU. That's 6.5x more capacity = 6.5x cost reduction. This is why vLLM went viral. This single optimization -- applying a 50-year-old OS concept to KV cache management -- transformed LLM serving economics.

6. The Scheduler: Serving 1000 Users on 1 GPU

Imagine a restaurant where the rule is: nobody can leave until the slowest eater finishes their meal. You ordered a salad (5 minutes), but the person next to you ordered a 7-course tasting menu (3 hours). You sit there for 3 hours waiting. That's static batching. vLLM's restaurant lets you pay and leave whenever you're done.

You have one GPU and requests arriving constantly. Some users want 10 tokens (a short answer), some want 1000 tokens (a full essay). How do you handle them?

Static Batching: The Head-of-Line Problem

Group requests into a fixed batch. Process the entire batch until every request finishes. Then start the next batch.


Batch: [User A (500 tokens), User B (10 tokens), User C (800 tokens)]

Step 10:  User B is DONE (only needed 10 tokens)
          But User B CAN'T LEAVE. Must wait for the batch.
Step 500: User A is DONE. Still waiting for C.
Step 800: User C finishes. NOW everyone is released.

User B waited 790 steps for NOTHING. That's 80x slower than necessary.

Continuous Batching: No Wasted GPU Cycles

Every step, check: did anyone finish? If yes, remove them immediately and admit a new request from the waiting queue.

User B gets their response at step 10, not step 800. The moment they leave, a new user joins. The GPU never has an idle slot.

The 3-Queue Scheduler

vLLM's scheduler manages three queues:

WAITING → requests that haven't started yet

RUNNING → actively generating tokens (KV cache allocated)

PREEMPTED → kicked out when GPU memory is full, will retry later

Each scheduling step:

Try to allocate decode slots for all running requests
If GPU memory is full → preempt the lowest-priority running request (free its KV blocks)
If memory is available → admit new requests from preempted queue (priority), then waiting queue

When a preempted request is re-admitted, there are two strategies:

Recompute: Discard its KV cache, redo prefill from scratch. Simple, no extra memory needed.
Swap: Copy its KV blocks to CPU RAM, restore to GPU when re-admitted. Faster re-admission but uses CPU memory.

The 4 Query Patterns

Mark Moyou from Nvidia called this "the most important slide" in his talk on LLM inference. Real production traffic has four distinct patterns:

Pattern	ISL (Prompt)	OSL (Output)	Bottleneck
Summarization	2000-4000	100-300	TTFT (prefill-dominated)
Chatbot	50-200	100-500	Balanced
Code Generation	200-500	500-2000	KV cache memory (decode-dominated)
Mixed Production	All of the above	All of the above	Everything at once

Static batching gets destroyed by mixed workloads: a code-generation request (800 tokens) blocks ten chatbot requests (10 tokens each) for hundreds of wasted steps. Continuous batching handles it gracefully -- short requests leave as soon as they're done.

Why this matters for cost: In mixed production traffic, continuous batching delivers 2-4x throughput improvement over static batching. Combined with PagedAttention's 6.5x memory efficiency, vLLM can serve an order of magnitude more users on the same hardware compared to naive approaches.

7. Putting It All Together

Every component connects into one coherent system:

Here's what happens when you send a prompt to vLLM:

Tokenizer breaks your text into token IDs
Embedding looks up each token's 768-dim vector + position vector
The scheduler admits your request if GPU memory allows (WAITING → RUNNING)
PagedAttention's block manager allocates KV cache blocks for your prompt
Prefill: All 12 attention layers process your full prompt at once (compute TTFT)
Decode loop: Each step generates one token. Attention reads from cached K,V. New K,V appended to blocks. Block manager allocates new blocks as needed.
When your request finishes, scheduler frees all KV blocks → other requests can use them
Detokenizer converts token IDs back to text → streamed to you

What I Built vs What vLLM Does

Every component I built from scratch maps directly to real vLLM source code:

My From-Scratch File	Real vLLM	What It Does
`tokenization_lab.py`	HuggingFace tokenizers	Text ↔ token IDs
`embedding_explorer.py`	Model embedding layers	Token IDs → vectors
`attention.py`	`vllm/attention/backends/`	Q/K/V computation, multi-head
`kv_cache.py`	KV cache infrastructure	Store and read K,V tensors
`block_manager.py`	`vllm/core/block_manager.py`	Allocate/free KV blocks
`paged_attention.py`	PagedAttention CUDA kernel	Attention over scattered blocks
`scheduler.py`	`vllm/core/scheduler.py`	3-queue scheduling, preemption
`continuous_batching.py`	vLLM engine loop	In-flight batching
`workload_simulator.py`	Production monitoring	ISL/OSL traffic analysis

Real Benchmarks (TinyLlama-1.1B on T4 GPU)

Running actual vLLM on a Google Colab T4 confirmed every prediction:

TTFT scales linearly with prompt length (prefill is compute-bound):


ISL=10: ~25ms | ISL=100: ~30ms | ISL=500: ~55ms | ISL=1000: ~95ms

Throughput scales with batch size (PagedAttention enables large batches):


Batch=1: ~180 tok/s | Batch=8: ~800 tok/s | Batch=32: ~2400 tok/s | Batch=64: ~2800 tok/s

15x throughput improvement from batch=1 to batch=64. This is only possible because PagedAttention doesn't waste memory on over-allocated KV caches.

ISL/OSL 2D histogram reveals traffic shape -- the visualization from the Nvidia talk. Three distinct clusters emerge: summarization (high ISL, low OSL), chatbot (low ISL, medium OSL), and code generation (medium ISL, high OSL). This is the #1 diagnostic tool for LLM serving teams.

8. Key Takeaways

Tokens, not words. LLMs process subword pieces (~4 characters each). Code uses ~2x more tokens than English, effectively halving your context window.

Embeddings give tokens meaning. The input to the transformer is token_embedding + position_embedding -- a matrix of 768-dimensional vectors on the GPU.

Attention = weighted averaging with a twist. Q matches against K, V provides the information. The causal mask enforces one-token-at-a-time generation. Multi-head attention learns 12 different relationship types simultaneously.

KV cache is the #1 optimization. Without it, generation is O(n^2). With it, O(n). But it trades compute for GPU memory. TTFT (prefill) grows with prompt length; ITL (decode) stays constant.

PagedAttention reduces memory waste from 81% to <1%. By allocating KV cache in 16-token blocks instead of contiguous max_seq_len chunks, vLLM serves 6.5x more concurrent users on the same GPU.

Continuous batching eliminates head-of-line blocking. Finished requests leave immediately, new ones join. No wasted GPU cycles. 2-4x throughput improvement.

GPU Memory = Model + KV Cache. Quantize the model (FP16→FP8→INT4) to free memory for more KV cache = more users = lower cost.