How vLLM Actually Works: I Built It From Scratch So You Don't Have To

A deep dive into LLM inference -- from a single character to serving millions of requests. With diagrams, code, real benchmarks, and the intuition behind every design decision.


When I first started learning about LLM serving, every explanation felt either too shallow ("vLLM uses PagedAttention!") or too dense (reading the actual paper with its proofs and math notation). The blog posts gave me vocabulary -- "KV cache," "continuous batching," "PagedAttention" -- but not understanding.

So I did something different. I rebuilt the entire vLLM inference pipeline from scratch. Not by reading the source code, but by implementing every component myself: tokenizer exploration, multi-head attention, KV caching, a block manager, a scheduler. Fourteen files of Python across five phases, each producing something runnable and measurable.

Then I ran real vLLM on a GPU and compared. Every prediction matched.

This is what I learned. By the end, you'll understand exactly what happens from the moment you type a prompt to the moment you see the response streaming back -- and why vLLM can serve 6x more users than naive approaches on the same GPU.

THE LLM INFERENCE PIPELINE "Make me sound smart" YOUR PROMPT Tokenizer [12050, 502, 2128, 4451] Embedding wte[id] + wpe[pos] → [4 × 768] matrix 12 Attention Layers + KV Cache Q×K^T → softmax → ×V (12 heads) PagedAttention + Scheduler 16-tok blocks | continuous batching "LLM inference is hard..." STREAMING OUTPUT — one token at a time PHASE 1 PHASE 1 PHASE 2 PHASE 3-4

Table of Contents

  1. Tokenization: Text Becomes Numbers
  2. Embeddings: Numbers Become Meaning
  3. Attention: How Tokens Relate to Each Other
  4. The KV Cache: Speed vs Memory
  5. PagedAttention: vLLM's Core Innovation
  6. The Scheduler: Serving 1000 Users on 1 GPU
  7. Putting It All Together
  8. Key Takeaways

1. Tokenization: Text Becomes Numbers

Imagine a translator converting English into Morse code. They don't go letter by letter (too slow), and they don't try to encode whole sentences (too rigid). Instead, they break text into efficient chunks -- common words stay whole, rare words get split into pieces. That's tokenization.

unbelievable BPE splits into subwords un ID: 403 bel ID: 6667 iev ID: 11203 able ID: 540 1 word → 4 tokens. A token ≈ 4 characters on average. Token Density by Content Type: English prose 5.1 chars/token Python code 2.7 chars/token (needs ~2x more tokens!) Technical 3.9 chars/token

GPUs do math on numbers, not words. The first step in LLM inference is converting your text into token IDs -- integers that index into a vocabulary the model learned during training.

GPT-2 uses Byte-Pair Encoding (BPE) with a vocabulary of 50,257 tokens. The algorithm looked at the entire training dataset and found the most efficient set of subword pieces to represent it all. Common words are single tokens; rare words get split:


tokenizer.encode("Hello")           # [15496]          -- 1 token
tokenizer.encode("unbelievable")    # [403, 6667, 11203, 540]  -- 4 tokens
                                    #  "un"  "bel"  "iev"  "able"

A token is roughly 4 characters on average. But this varies dramatically by content type:

Content TypeChars per TokenImplication
English prose5.1Most efficient
Technical text3.9Slightly more tokens
Python code2.7Almost 2x more tokens than English!

This matters more than people realize. When someone says "this model supports 128K context," that's 128K tokens -- roughly 512K characters of English, but only 345K characters of code. Your effective context window is almost half the size for code.

Why this matters for cost: Every token consumes GPU memory and compute. Code-heavy workloads use ~2x more tokens per character than English, doubling the effective cost of serving those requests. Understanding token density is the first step to understanding serving costs.

2. Embeddings: Numbers Become Meaning

Think of a map where cities close in meaning are close in space. "King" and "queen" are neighbors. "Dog" and "cat" are in the same district. "King" and "banana" are on different continents. Except this map has 768 dimensions instead of 2.

Token ID 15496 for "Hello" is just an arbitrary number. The model needs something richer -- a vector that captures semantic meaning. That's the embedding table.

Token Embedding (wte) [50,257 × 768] ID 0 [0.12, -0.45, ...] ID 1 [-0.33, 0.91, ...] ... ID 12050 [0.55, -0.22, ...] ↑ "Make" lives here WHAT each token means + Position Embed (wpe) [1,024 × 768] pos 0 [0.01, 0.22, ...] pos 1 [0.03, 0.19, ...] ... WHERE it sits = Model Input [4 × 768] ← "Make" ← "me" ← "sound" ← "smart" 3,072 numbers → GPU input = wte[token_id] + wpe[position] Your prompt is now a matrix of numbers. This is what the model actually sees.

GPT-2's embedding table (wte) is a matrix of shape [50,257 x 768] -- one row of 768 numbers for each token in the vocabulary. These vectors were learned during training, and they encode rich semantic relationships:


# Cosine similarity between embedding vectors:
sim("king", "queen")  = 0.81   # both royalty -- very similar!
sim("man", "woman")   = 0.64   # both people
sim("dog", "cat")     = 0.55   # both animals
sim("king", "banana") = 0.26   # unrelated

Nobody programmed these. The model learned them from billions of words.

Position Matters

There's a second embedding table: positional embeddings (wpe), shape [1,024 x 768]. Without it, "dog bites man" and "man bites dog" would look identical to the model -- same tokens, same vectors.

The positional embedding encodes where each token sits in the sequence. Position 0 gets one vector, position 1 gets a different one, and so on up to position 1,023 (GPT-2's max context length).

The final input to the transformer is:


input = token_embedding[token_id] + position_embedding[position]

For "Make me sound smart" (4 tokens), this produces a [4 x 768] matrix -- 3,072 floating-point numbers. That matrix is what goes on the GPU. I built this manually and verified it matches GPT-2's internal computation with zero error.

The GPU Memory Equation

This is the equation that governs everything in LLM serving:


GPU Memory = Model Weights + KV Cache (your tokens)
GPU MEMORY BUDGET: MODEL WEIGHTS vs KV CACHE Llama 8B on T4 (16GB) — FP16 14.9 GB model weights 1.1 Barely fits! Llama 8B on A100 (80GB) — FP16 14.9GB 65.1 GB for KV cache Room! Llama 8B on H100 (80GB) — FP8 7.5 72.5 GB for KV cache! FP8!

Llama 8B in FP16 takes 14.9 GB. On a 16 GB T4, that leaves only 1.1 GB for tokens -- barely room for a handful of concurrent users. On an 80 GB A100, you get 65.1 GB. With FP8 quantization on an H100 (half the model size), you get 72.5 GB.

Why this matters for cost: GPU Memory = Model + KV Cache. Shrink the model (quantize from FP16 to FP8 or INT4) and you get more room for KV cache, which means more concurrent users, which means lower cost per request. This is why Nvidia's H100 was so popular -- FP8 support doubled the effective capacity. And FP4 on Blackwell will double it again.

3. Attention: How Tokens Relate to Each Other

Imagine a classroom where students can pass notes to each other, but there's one rule: you can only pass notes to students sitting in front of you -- never behind. Each student writes a question (Query), advertises what they know (Key), and shares their notes (Value). The students sitting in front of you whose knowledge matches your question get the most attention.

We now have a [4 x 768] matrix on the GPU. But each token's vector is independent -- "smart" doesn't know that "sound" came before it. The model needs to figure out: when predicting the next token, which previous tokens are important?

Q, K, V: Three Roles

The input matrix gets multiplied by three different learned weight matrices:

Q QUERY "What am I looking for?" 🔍 K KEY "What do I contain?" 🔑 V VALUE "Here's my information" 📖 attention = softmax( Q x KT / √dk ) x V

The attention formula:


attention = softmax(Q x K^T / sqrt(d_k)) x V

Let me trace through with real numbers. For "I love dogs" with simplified 3-dim vectors:

Step 1: Q x K^T -- every token scores against every other:


Score matrix:        "I"    "love"   "dogs"
"I" asks:          [  6      3        3   ]   ← "I" cares most about itself
"love" asks:       [  5      10       7   ]   ← "love" cares most about "love"
"dogs" asks:       [  5      5        4   ]   ← "dogs" cares about "I" and "love" equally

Step 2: / sqrt(d_k) -- divide by square root of dimension to prevent extreme scores.

Step 3: Causal mask -- this is critical. Each token can only see past tokens:


Mask:  [1  0  0]    "I" can ONLY see itself
       [1  1  0]    "love" sees "I" and itself
       [1  1  1]    "dogs" sees everything

Where mask = 0, score becomes -infinity → softmax turns it to 0%

This is why LLMs generate one token at a time. During generation, future tokens don't exist yet. The mask prevents the model from cheating.

Step 4: softmax x V -- convert scores to percentages, blend information:


"dogs" attention weights: 38% on "I", 38% on "love", 24% on itself
output_for_dogs = 0.38 × V["I"] + 0.38 × V["love"] + 0.24 × V["dogs"]
ATTENTION WEIGHTS (after softmax + causal mask) I love dogs very I 100% blocked blocked blocked love 1% 99% blocked blocked dogs 25% 38% 37% blocked very 15% 28% 22% 35% CAUSAL MASK Can't see future tokens. This is why LLMs generate one token at a time. Row "I": 100% self-attention Row "love": 99% on itself Row "very": sees everything Brighter teal = more attention. Striped = blocked by causal mask. Each row sums to 100%. Each row is a probability distribution over visible tokens.

Multi-Head: 12 Perspectives at Once

One attention head can only learn one type of relationship. GPT-2 runs 12 heads in parallel, each with 64 dimensions (12 x 64 = 768), each learning different patterns:

768 dims Input Split 12 Attention Heads (each does full Q×K^T→softmax→×V) H0 64d H1 64d H2 64d H3 64d H4 64d H5 H6 H7 H8 H9 H10 H11 Concat 768 dims W_o Project 12 heads × 64 dims = 768 dims. Each head learns a different type of relationship.

Head 0 might learn subject-verb relationships. Head 1 might learn adjective-noun patterns. Head 5 might learn coreference ("he" refers to "John"). All 12 compute independently, then their outputs are concatenated and mixed.

The Full Transformer Block

One block is: LayerNorm → Attention (12 heads) → Residual → LayerNorm → MLP (768→3072→768) → Residual

The MLP expands to 4x width for "thinking" then compresses back. The residual connection (output = input + attention(input)) prevents information loss through deep networks. GPT-2 stacks 12 of these blocks. After all 12, a projection to vocabulary size (50,257) produces the next-token prediction.

Why this matters for cost: Llama 70B has 80 layers with 64 attention heads each. Every single token generation requires attention across all heads in all layers. This is enormous compute -- and it happens for every token, for every user, continuously. The attention mechanism IS the compute cost.

4. The KV Cache: Speed vs Memory

Imagine two students taking an exam. Student A takes meticulous notes -- every time they learn something, they write it down. When a new question comes, they glance at their notes and answer quickly. Student B refuses to take notes -- for every question, they re-read the entire textbook from page 1. Same answers, but Student A finishes 50x faster.

During generation, we produce tokens one at a time. At each step, the new token's Query needs to attend to ALL previous tokens' Keys and Values. But Keys and Values for previous tokens don't change -- they depend only on the token's embedding and the weight matrices, both of which are fixed.

Without caching, we recompute K and V for every previous token at every step:

Without KV Cache Recompute everything each step Step 1 2 tok Step 2 redo! Step 3 Step 5 Step 10 Step 50 recomputing everything! O(n²) total work 2+3+4+...+101 = ~5,000 With KV Cache Only process NEW token + read cache Step 1 prefill cache Step 2 1 Step 3 Step 5 Step 10 cached K,V growing Step 50 still only 1 new token computed! O(n) total work 2 + 1×100 = 102 Same output. 50x less work. But the cache consumes GPU memory.

WITHOUT cache (naive):
  Step 1:   compute Q,K,V for 2 tokens
  Step 2:   compute Q,K,V for 3 tokens (2 REDUNDANT!)
  Step 100: compute Q,K,V for 101 tokens (100 REDUNDANT!)
  Total work: 2+3+4+...+101 = ~5,000 computations. O(n²)

WITH cache:
  Step 1:   compute Q,K,V for 2 tokens. SAVE K,V to cache.
  Step 2:   compute Q,K,V for 1 NEW token. Append to cache.
  Step 100: compute Q,K,V for 1 NEW token. Append to cache.
  Total work: 2 + 1×100 = 102 computations. O(n)

I implemented both and measured on GPT-2: 3.1x speedup, identical output. For larger models with longer sequences, the gap grows to 10-100x.

Two Phases, Two Metrics

LLM generation has two distinct phases:

time PREFILL Process ENTIRE prompt at once [seq_len × 768] × [768 × 768] matrix × matrix (compute-bound) TTFT Grows with prompt length (ISL) First token! tok 1 tok 2 tok 3 tok 4 tok 5 ... DECODE ITL Roughly constant (matrix × vector)

PREFILL: Process the entire prompt at once. This is a matrix-matrix multiplication (compute-bound). The time this takes = TTFT (Time to First Token). It grows with prompt length.

DECODE: Generate tokens one at a time, using the KV cache. This is a matrix-vector multiplication (memory-bandwidth-bound). The time per token = ITL (Inter-Token Latency). It's roughly constant regardless of sequence position.

From my benchmarks on our from-scratch GPT-2:


Prompt Length    TTFT        Avg ITL
5 tokens         27 ms       8.1 ms
20 tokens        24 ms       9.6 ms
45 tokens        40 ms       9.9 ms

TTFT grows with ISL. ITL stays constant. Exactly as predicted.

Why this matters for cost: KV cache trades memory for speed. 100 concurrent users with 4096-token sequences on GPT-2 alone consumes 15 GB of KV cache. For Llama 70B with 128K context, the numbers are staggering. This memory pressure is THE problem that vLLM solves.

5. PagedAttention: vLLM's Core Innovation

Imagine a hotel. The old hotel reserves an entire floor (2048 rooms) for every guest at check-in, even though most guests only use 200 rooms. 90% of rooms sit empty. The new hotel (vLLM) gives you exactly the rooms you need, scattered across different floors, and hands you a key card that maps your room numbers to physical locations. Less than 1% waste.

This is the key insight from the 2023 paper by Kwon et al. that made vLLM famous. Traditional KV cache allocation is catastrophically wasteful.

The Problem: 81% Memory Waste

Traditional approach: when a request arrives, reserve max_seq_len (e.g., 2048 slots) of contiguous GPU memory for its KV cache. If the actual sequence is only 200 tokens, 90% is wasted.

Contiguous Allocation Reserve max_seq_len (2048) per request Req A 200 1,848 wasted Req B 80 1,968 wasted Req C 500 1,548 wasted 81% WASTED Max 4 concurrent requests Paged Allocation (vLLM) 16-token blocks, allocated on demand A B - A C - C A C A B - C A - B A C C A - C A B A = Req A (scattered!) B = Req B (scattered!) C = Req C (scattered!) - = Free block <1% WASTED 26 concurrent requests (6.5x more!)

Across realistic workloads with varying sequence lengths, I measured 81% total memory waste with contiguous allocation. And because allocations must be contiguous, you also get fragmentation -- even if enough total free memory exists, it might not be in one contiguous chunk.

The Solution: OS-Style Memory Paging

Instead of one big contiguous block, vLLM breaks GPU memory into small fixed-size blocks (typically 16 tokens each). Blocks are allocated on demand as the sequence grows. When the sequence finishes, blocks return to the free pool.

The data structures map directly to how your operating system manages RAM:

vLLM ComponentOS EquivalentWhat It Does
PhysicalBlockMemory PageHolds KV data for 16 tokens
BlockAllocatorFree Page ListO(1) allocate and free
BlockTablePage TableMaps logical → physical blocks per request
BlockManagerMemory ManagerOrchestrates across all requests
Request A's Logical View (what the model sees) Block 0 tokens 0-15 Block 1 tokens 16-31 Block 2 tokens 32-47 Block 3 tokens 48-55 Block Table = Page Table Maps logical order → scattered physical locations Physical GPU Memory (blocks scattered!) B:0 phys 0 A:2 phys 1 free phys 2 A:0 phys 3 free phys 4 B:1 phys 5 free phys 6 A:1 phys 7 free phys 8 A:3 phys 9 Blocks are SCATTERED in physical memory. The block table maps the correct order. No fragmentation. Any free block can go to any request.

The blocks don't need to be adjacent. Request A might use physical blocks [3, 7, 1, 9] -- completely scattered. The block table maps the logical order to physical locations, just like a page table in your OS.

Key insight: blocks are all the same size, so any free block can go to any request. No fragmentation.

From my from-scratch benchmark:


                    Contiguous    Paged (block_size=16)
Total allocated:    20,480        3,792
Actually used:       3,765        3,765
WASTED:             16,715 (81%)  27 (0.7%)
Max concurrent:      4 requests   26 requests
Why this matters for cost: 81% waste → <1% waste. 4 concurrent users → 26 concurrent users on the same GPU. That's 6.5x more capacity = 6.5x cost reduction. This is why vLLM went viral. This single optimization -- applying a 50-year-old OS concept to KV cache management -- transformed LLM serving economics.

6. The Scheduler: Serving 1000 Users on 1 GPU

Imagine a restaurant where the rule is: nobody can leave until the slowest eater finishes their meal. You ordered a salad (5 minutes), but the person next to you ordered a 7-course tasting menu (3 hours). You sit there for 3 hours waiting. That's static batching. vLLM's restaurant lets you pay and leave whenever you're done.

You have one GPU and requests arriving constantly. Some users want 10 tokens (a short answer), some want 1000 tokens (a full essay). How do you handle them?

Static Batching: The Head-of-Line Problem

Group requests into a fixed batch. Process the entire batch until every request finishes. Then start the next batch.

Static Batching Everyone waits for the longest request Long 500 tokens Short1 done at step 10... STILL WAITING Short2 Short3 All released at step 500 Short requests waited 490 steps for NOTHING. 50x slower! Continuous Batching Finished requests leave immediately Long 500 tokens (still running) Short1 done! Short2 done! Short3 done! New E joins at step 10! New F Step 10: short requests FREE GPU never idle. New requests fill the empty slots.

Batch: [User A (500 tokens), User B (10 tokens), User C (800 tokens)]

Step 10:  User B is DONE (only needed 10 tokens)
          But User B CAN'T LEAVE. Must wait for the batch.
Step 500: User A is DONE. Still waiting for C.
Step 800: User C finishes. NOW everyone is released.

User B waited 790 steps for NOTHING. That's 80x slower than necessary.

Continuous Batching: No Wasted GPU Cycles

Every step, check: did anyone finish? If yes, remove them immediately and admit a new request from the waiting queue.

User B gets their response at step 10, not step 800. The moment they leave, a new user joins. The GPU never has an idle slot.

The 3-Queue Scheduler

vLLM's scheduler manages three queues:

vLLM's 3-QUEUE SCHEDULER WAITING New requests queue here admit RUNNING Actively generating tokens done FINISHED Response sent to user PREEMPTED Evicted (GPU memory full) no memory memory freed

WAITING → requests that haven't started yet

RUNNING → actively generating tokens (KV cache allocated)

PREEMPTED → kicked out when GPU memory is full, will retry later

Each scheduling step:

  1. Try to allocate decode slots for all running requests
  2. If GPU memory is full → preempt the lowest-priority running request (free its KV blocks)
  3. If memory is available → admit new requests from preempted queue (priority), then waiting queue

When a preempted request is re-admitted, there are two strategies:

The 4 Query Patterns

Mark Moyou from Nvidia called this "the most important slide" in his talk on LLM inference. Real production traffic has four distinct patterns:

PatternISL (Prompt)OSL (Output)Bottleneck
**Summarization**2000-4000100-300TTFT (prefill-dominated)
**Chatbot**50-200100-500Balanced
**Code Generation**200-500500-2000KV cache memory (decode-dominated)
**Mixed Production**All of the aboveAll of the aboveEverything at once

Static batching gets destroyed by mixed workloads: a code-generation request (800 tokens) blocks ten chatbot requests (10 tokens each) for hundreds of wasted steps. Continuous batching handles it gracefully -- short requests leave as soon as they're done.

Why this matters for cost: In mixed production traffic, continuous batching delivers 2-4x throughput improvement over static batching. Combined with PagedAttention's 6.5x memory efficiency, vLLM can serve an order of magnitude more users on the same hardware compared to naive approaches.

7. Putting It All Together

Every component connects into one coherent system:

THE COMPLETE vLLM ARCHITECTURE YOUR PROMPT Phase 1: Tokenization & Embedding text → token IDs → wte[id] + wpe[pos] → [seq_len × 768] matrix BPE tokenizer | 50,257 vocab | 768 dimensions Phase 2: Attention + KV Cache Q×K^T → causal mask → softmax → ×V | 12 layers × 12 heads TTFT (prefill) | ITL (decode) | KV cache stores K,V per layer Phase 3: PagedAttention (Block Manager) 16-token blocks | block tables | O(1) alloc/free <1% memory waste | 6.5x more concurrent users Phase 4: Scheduler & Continuous Batching WAITING → RUNNING → FINISHED | preemption No head-of-line blocking | 2-4x throughput "LLM inference is hard..." (streaming) STREAMING RESPONSE

Here's what happens when you send a prompt to vLLM:

  1. Tokenizer breaks your text into token IDs
  2. Embedding looks up each token's 768-dim vector + position vector
  3. The scheduler admits your request if GPU memory allows (WAITING → RUNNING)
  4. PagedAttention's block manager allocates KV cache blocks for your prompt
  5. Prefill: All 12 attention layers process your full prompt at once (compute TTFT)
  6. Decode loop: Each step generates one token. Attention reads from cached K,V. New K,V appended to blocks. Block manager allocates new blocks as needed.
  7. When your request finishes, scheduler frees all KV blocks → other requests can use them
  8. Detokenizer converts token IDs back to text → streamed to you

What I Built vs What vLLM Does

Every component I built from scratch maps directly to real vLLM source code:

My From-Scratch FileReal vLLMWhat It Does
`tokenization_lab.py`HuggingFace tokenizersText ↔ token IDs
`embedding_explorer.py`Model embedding layersToken IDs → vectors
`attention.py``vllm/attention/backends/`Q/K/V computation, multi-head
`kv_cache.py`KV cache infrastructureStore and read K,V tensors
`block_manager.py``vllm/core/block_manager.py`Allocate/free KV blocks
`paged_attention.py`PagedAttention CUDA kernelAttention over scattered blocks
`scheduler.py``vllm/core/scheduler.py`3-queue scheduling, preemption
`continuous_batching.py`vLLM engine loopIn-flight batching
`workload_simulator.py`Production monitoringISL/OSL traffic analysis

Real Benchmarks (TinyLlama-1.1B on T4 GPU)

Running actual vLLM on a Google Colab T4 confirmed every prediction:

TTFT scales linearly with prompt length (prefill is compute-bound):


ISL=10: ~25ms | ISL=100: ~30ms | ISL=500: ~55ms | ISL=1000: ~95ms

Throughput scales with batch size (PagedAttention enables large batches):


Batch=1: ~180 tok/s | Batch=8: ~800 tok/s | Batch=32: ~2400 tok/s | Batch=64: ~2800 tok/s

15x throughput improvement from batch=1 to batch=64. This is only possible because PagedAttention doesn't waste memory on over-allocated KV caches.

ISL/OSL 2D histogram reveals traffic shape -- the visualization from the Nvidia talk. Three distinct clusters emerge: summarization (high ISL, low OSL), chatbot (low ISL, medium OSL), and code generation (medium ISL, high OSL). This is the #1 diagnostic tool for LLM serving teams.


8. Key Takeaways

  1. Tokens, not words. LLMs process subword pieces (~4 characters each). Code uses ~2x more tokens than English, effectively halving your context window.
  1. Embeddings give tokens meaning. The input to the transformer is token_embedding + position_embedding -- a matrix of 768-dimensional vectors on the GPU.
  1. Attention = weighted averaging with a twist. Q matches against K, V provides the information. The causal mask enforces one-token-at-a-time generation. Multi-head attention learns 12 different relationship types simultaneously.
  1. KV cache is the #1 optimization. Without it, generation is O(n^2). With it, O(n). But it trades compute for GPU memory. TTFT (prefill) grows with prompt length; ITL (decode) stays constant.
  1. PagedAttention reduces memory waste from 81% to <1%. By allocating KV cache in 16-token blocks instead of contiguous max_seq_len chunks, vLLM serves 6.5x more concurrent users on the same GPU.
  1. Continuous batching eliminates head-of-line blocking. Finished requests leave immediately, new ones join. No wasted GPU cycles. 2-4x throughput improvement.
  1. GPU Memory = Model + KV Cache. Quantize the model (FP16→FP8→INT4) to free memory for more KV cache = more users = lower cost.

Further Reading


I built every component from scratch to understand it, then validated on real hardware. The Nvidia talk gave me the "why." Building it gave me the "how." Real benchmarks gave me the "proof."

If you want to truly understand LLM inference -- don't just read about it. Build it.