This is part of my vLLM learning series. In this session, I cover Step 1 (The User API).
Note: This content was generated by Claude, grounded on the actual
vLLM codebase. It is intended for personal
learning only and may contain inaccuracies. Always verify against the
original source code and official documentation.
Topic: vLLM
Date: 2026-01-31
Sections covered: Step 1 (The User API)
Prerequisites: None
Today's Material
1. What is vLLM and Why Does It Matter?
LLM inference is GPU-memory-bound. When a model generates text, it needs to store key-value (KV) caches β intermediate computations from the attention mechanism β for every token in every active request. Naive implementations pre-allocate the maximum possible sequence length for each request, wasting 60-80% of GPU memory on empty space.
vLLM solves this with PagedAttention: instead of pre-allocating a giant contiguous buffer per request, it carves GPU memory into fixed-size blocks (default 16 tokens each) and allocates them on demand β just like how an operating system manages virtual memory with pages.
The result: near-optimal memory utilization and 2-4x higher throughput than HuggingFace Transformers on typical workloads.
π Note:
Think of the difference like this: the naive approach is like reserving an entire row of seats in a theater for each person "just in case" they bring friends. PagedAttention is like assigning individual seats as people actually show up.
2. High-Level Architecture
Before diving into code, here's the bird's-eye view of how vLLM is organized:
βββββββββββββββββββββββββββββββββββββββββββββββββ
β User-Facing Layer β
β LLM class | OpenAI API Server | gRPC β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β Engine Layer β
β InputProcessor β EngineCoreClient β OutputProcessor β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β Engine Core β
β Scheduler β Executor β Workers β GPU β
β βββ KVCacheManager (BlockPool) β
βββββββββββββββββββββββββββββββββββββββββββββββββ
Three layers:
- User-Facing β Multiple entry points (Python API, HTTP, gRPC) that all funnel into the engine
- Engine Layer β Tokenize inputs, relay to the core, format outputs
- Engine Core β The scheduling loop, KV cache management, and GPU execution
Today we focus on layer 1: the LLM class and its associated types.
3. The LLM Class β Your Main Interface
The LLM class in vllm/entrypoints/llm.py is the primary interface for offline batch inference. Here's its constructor (simplified to the most important parameters):
# vllm/entrypoints/llm.py
class LLM:
def __init__(
self,
model: str,
*,
tokenizer: str | None = None,
tensor_parallel_size: int = 1,
dtype: ModelDType = "auto",
quantization: QuantizationMethods | None = None,
gpu_memory_utilization: float = 0.9,
seed: int = 0,
**kwargs: Any,
) -> None:
engine_args = EngineArgs(
model=model,
tokenizer=tokenizer,
tensor_parallel_size=tensor_parallel_size,
dtype=dtype,
quantization=quantization,
gpu_memory_utilization=gpu_memory_utilization,
**kwargs,
)
self.llm_engine = LLMEngine.from_engine_args(
engine_args=engine_args,
usage_context=UsageContext.LLM_CLASS,
)
self.request_counter = Counter()
Key things to notice:
-
LLMis thin β it creates anEngineArgsconfig, then hands everything off toLLMEngine.from_engine_args() -
gpu_memory_utilization=0.9means vLLM claims 90% of GPU memory for the KV cache, reserving 10% for PyTorch overhead (model weights, activations, etc.) -
tensor_parallel_sizecontrols how many GPUs to shard the model across β set to 1 for single-GPU
π‘ Tip:
If you get CUDA out-of-memory errors, lowergpu_memory_utilization(e.g., to 0.8). If you want more throughput and have headroom, raise it (up to ~0.95).
4. The generate() Method β Where Requests Enter
# vllm/entrypoints/llm.py
def generate(
self,
prompts: PromptType | Sequence[PromptType],
sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
*,
use_tqdm: bool | Callable[..., tqdm] = True,
lora_request: list[LoRARequest] | LoRARequest | None = None,
priority: list[int] | None = None,
) -> list[RequestOutput]:
model_config = self.model_config
runner_type = model_config.runner_type
if runner_type != "generate":
raise ValueError(
"LLM.generate() is only supported for generative models.")
if sampling_params is None:
sampling_params = self.get_default_sampling_params()
self._validate_and_add_requests(
prompts=prompts,
params=sampling_params,
use_tqdm=use_tqdm,
lora_request=...,
priority=priority,
)
outputs = self._run_engine(use_tqdm=use_tqdm)
return self.engine_class.validate_outputs(outputs, RequestOutput)
The flow is:
- Validate that this is a generative model (not an embedding model)
-
Add requests via
_validate_and_add_requests()β normalizes inputs, pairs each prompt with itsSamplingParams, and sends them to the engine -
Run the engine via
_run_engine()β loops until all requests are finished -
Return sorted
RequestOutputobjects
You can pass a single SamplingParams (applied to all prompts) or a list (one per prompt). This is useful when different prompts need different temperatures or stop conditions.
5. _run_engine() β The Processing Loop
This is where the actual inference happens:
# vllm/entrypoints/llm.py
def _run_engine(
self, *, use_tqdm: bool | Callable[..., tqdm] = True
) -> list[RequestOutput | PoolingRequestOutput]:
outputs: list[RequestOutput | PoolingRequestOutput] = []
total_in_toks = 0
total_out_toks = 0
while self.llm_engine.has_unfinished_requests():
step_outputs = self.llm_engine.step()
for output in step_outputs:
if output.finished:
outputs.append(output)
# Sort the outputs by request ID.
# This is necessary because some requests may be finished earlier than
# its previous requests.
return sorted(outputs, key=lambda x: int(x.request_id))
The key insight: _run_engine() is a simple loop. It calls self.llm_engine.step() repeatedly. Each step() runs one iteration of the scheduling + inference pipeline β potentially processing hundreds of requests in a single forward pass. Finished requests come back as RequestOutput objects.
β οΈ Warning:
The outputs are sorted byrequest_idat the end because requests don't finish in order. A short request (e.g., "Say hi") may finish in 5 iterations while a long request (e.g., "Write an essay") takes 500. The sorting ensures the output list matches the input prompt order.
Why this matters: This loop is where continuous batching happens. Unlike static batching (process N prompts, wait for all to finish, return), vLLM processes requests at different stages simultaneously. Request A might be mid-generation while Request B is just starting its prefill.
6. SamplingParams β Controlling Generation
Every request carries a SamplingParams that controls how tokens are selected:
# vllm/sampling_params.py
class SamplingParams(
PydanticMsgspecMixin,
msgspec.Struct,
omit_defaults=True,
dict=True,
):
# --- Core sampling ---
n: int = 1 # Number of output sequences
temperature: float = 1.0 # 0 = greedy, higher = more random
top_p: float = 1.0 # Nucleus sampling threshold
top_k: int = 0 # Top-K filtering (0 = disabled)
min_p: float = 0.0 # Minimum probability threshold
seed: int | None = None # Reproducible sampling
# --- Penalties ---
presence_penalty: float = 0.0 # Penalize tokens that appeared
frequency_penalty: float = 0.0 # Penalize by frequency
repetition_penalty: float = 1.0 # Multiplicative penalty
# --- Generation limits ---
max_tokens: int | None = 16 # Output length limit
min_tokens: int = 0 # Minimum before allowing EOS
ignore_eos: bool = False # Don't stop at EOS
# --- Stop conditions ---
stop: str | list[str] | None = None
stop_token_ids: list[int] | None = None
# --- Output control ---
logprobs: int | None = None # Return top-N log probabilities
prompt_logprobs: int | None = None # Prompt token log probs
detokenize: bool = True # Decode token IDs to text
# --- Advanced ---
structured_outputs: StructuredOutputsParams | None = None # JSON schema
logit_bias: dict[int, float] | None = None
output_kind: RequestOutputKind = RequestOutputKind.CUMULATIVE
Notice that SamplingParams inherits from msgspec.Struct, not a Python dataclass. This is a deliberate performance choice β msgspec serialization is 10-50x faster than pickle, which matters when requests cross process boundaries (more on this in a future session).
Validation logic
SamplingParams.__post_init__() enforces constraints:
# vllm/sampling_params.py
def __post_init__(self) -> None:
# Normalize stop to a list
if self.stop is None:
self.stop = []
elif isinstance(self.stop, str):
self.stop = [self.stop]
# Zero temperature β force greedy sampling
if self.temperature < _SAMPLING_EPS:
self.top_p = 1.0
self.top_k = 0
self.min_p = 0.0
self._verify_greedy_sampling()
self._verify_args()
def _verify_args(self) -> None:
if self.n < 1:
raise ValueError(f"n must be at least 1, got {self.n}.")
if not -2.0 <= self.presence_penalty <= 2.0:
raise ValueError(...)
if self.temperature < 0.0:
raise VLLMValidationError(...)
if not 0.0 < self.top_p <= 1.0:
raise VLLMValidationError(...)
if self.max_tokens is not None and self.max_tokens < 1:
raise VLLMValidationError(...)
π Note:
Whentemperature=0, vLLM automatically setstop_p=1.0,top_k=0, andmin_p=0.0. This is because greedy decoding (always pick the highest-probability token) makes all other sampling parameters irrelevant. The code enforces this rather than letting the user set contradictory values.
7. RequestOutput and CompletionOutput β What You Get Back
After generate() finishes, you get a list of RequestOutput objects:
# vllm/outputs.py
class RequestOutput:
def __init__(
self,
request_id: str,
prompt: str | None,
prompt_token_ids: list[int] | None,
prompt_logprobs: PromptLogprobs | None,
outputs: list[CompletionOutput],
finished: bool,
metrics: RequestStateStats | None = None,
num_cached_tokens: int | None = None,
...
): ...
Each RequestOutput contains one or more CompletionOutput objects (one per n in SamplingParams):
# vllm/outputs.py
@dataclass
class CompletionOutput:
index: int # Which of the n outputs
text: str # Generated text
token_ids: GenericSequence[int] # Generated token IDs
cumulative_logprob: float | None # Sum of log probs
logprobs: SampleLogprobs | None # Per-token log probs
finish_reason: str | None # "stop", "length", or None
stop_reason: int | str | None # What triggered the stop
def finished(self) -> bool:
return self.finish_reason is not None
A typical usage pattern:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(
["What is the capital of France?", "Explain gravity in one sentence."],
SamplingParams(temperature=0.7, max_tokens=64)
)
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
reason = output.outputs[0].finish_reason # "stop" or "length"
print(f"Prompt: {prompt}")
print(f"Output: {generated}")
print(f"Finished because: {reason}\n")
When n > 1, you get multiple completions per prompt:
outputs = llm.generate(
["Tell me a joke."],
SamplingParams(n=3, temperature=0.9, max_tokens=100)
)
# outputs[0].outputs has 3 CompletionOutput objects
for completion in outputs[0].outputs:
print(f"Completion {completion.index}: {completion.text}")
8. Beyond generate() β Other Task Types
The LLM class supports more than text generation:
# Chat (applies chat template automatically)
outputs = llm.chat(
messages=[{"role": "user", "content": "What is 2+2?"}],
sampling_params=SamplingParams(max_tokens=32)
)
# Embeddings
outputs = llm.embed(["Hello world", "Goodbye world"])
# Classification (not all models support this)
outputs = llm.classify(["This movie was great!", "Terrible film."])
# Scoring (cross-encoder style)
outputs = llm.score("query text", ["doc1", "doc2", "doc3"])
Each method validates that the loaded model supports the requested task via runner_type. If you try llm.generate() on an embedding model, you get a clear error.
Exercises
Exercise 1: Basic Generation
Difficulty: Beginner
Goal: Understand the relationship between SamplingParams and output
Given this code:
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0, max_tokens=5, n=2)
outputs = llm.generate(["Count from 1 to 10."], params)
- What happens when
temperature=0andn=2? Will the two completions be different or identical? Why? - What will
finish_reasonbe for each completion? ("stop"or"length") - How many
CompletionOutputobjects will be inoutputs[0].outputs?
Hint: Think about what greedy decoding means for multiple samples.
Solution
-
Identical. Temperature=0 means greedy decoding β always pick the highest-probability token. With no randomness, every sample produces the exact same sequence. Running
n=2with greedy is wasteful. -
"length"for both.max_tokens=5will cut off "Count from 1 to 10" well before the model naturally stops β it would need at least ~20 tokens ("1, 2, 3, 4, 5, 6, 7, 8, 9, 10"). -
2 β one per
n.outputs[0].outputs[0]andoutputs[0].outputs[1], though both will have the same text.
Exercise 2: Trace the Call Path
Difficulty: Intermediate
Goal: Map the execution flow from user call to engine loop
Trace what happens when this code executes:
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(
["Hello", "World"],
SamplingParams(max_tokens=10)
)
For each step, name the method and describe what it does:
- What does
generate()call first? - How are the two prompts and the single
SamplingParamspaired? - What does
_run_engine()do on each iteration? - Why are outputs sorted at the end?
Solution
-
generate()first validates the runner type is"generate", then calls_validate_and_add_requests(). - The single
SamplingParamsis replicated:[params] * num_requestsβ so both "Hello" and "World" get the samemax_tokens=10. - Each iteration calls
self.llm_engine.step(), which runs one scheduling + inference cycle. Finished requests are collected into theoutputslist. - Because requests finish out of order. "Hello" (shorter) might finish before "World" (or vice versa depending on generation). Sorting by
request_idensuresoutputs[0]corresponds to "Hello" andoutputs[1]to "World".
Exercise 3: SamplingParams Edge Cases
Difficulty: Intermediate
Goal: Understand validation and normalization
What happens in each case? Does it succeed, raise an error, or get silently normalized?
SamplingParams(temperature=-0.5)SamplingParams(temperature=0, top_k=50)SamplingParams(top_p=0.0)-
SamplingParams(stop="END", include_stop_str_in_output=False)β what doesoutput_text_buffer_lengthget set to? SamplingParams(max_tokens=0)
Solution
-
Raises
VLLMValidationErrorβ_verify_args()checksself.temperature < 0.0. -
Silently normalized. When temperature=0,
__post_init__forcestop_k=0(along withtop_p=1.0,min_p=0.0). Yourtop_k=50is overwritten. -
Raises
VLLMValidationErrorβ_verify_args()checksnot 0.0 < self.top_p <= 1.0. Zero is not in the valid range. -
output_text_buffer_lengthis set tolen("END") - 1 = 2. This buffer ensures the output processor doesn't emit text that might be part of the stop string before the full match is determined. -
Raises
VLLMValidationErrorβ_verify_args()checksself.max_tokens < 1when not None.
Exercise 4: Design a Batch Inference Script
Difficulty: Advanced
Goal: Apply what you've learned to a realistic scenario
You have a file with 10,000 prompts (one per line). You need to generate completions with temperature=0.8 and max_tokens=256, saving results to a JSON file. Design the script:
- Should you call
generate()once with all 10,000 prompts, or in batches of 100? Why? - How would you handle prompts that need different
max_tokens? - If 3 out of 10,000 prompts fail, how would you know which ones? (Hint: look at
request_id)
Solution
-
Call
generate()once with all 10,000. vLLM's continuous batching handles scheduling internally β it dynamically fits as many requests as GPU memory allows per step. Breaking into batches of 100 would serialize work unnecessarily and prevent vLLM from optimally utilizing the GPU. -
Pass a list of
SamplingParams, one per prompt:[SamplingParams(max_tokens=t) for t in per_prompt_max_tokens]. This lets each prompt have its own configuration. -
Match by index. Since
generate()sorts outputs byrequest_id(which maps to the input order),outputs[i]corresponds toprompts[i]. Checkoutputs[i].outputs[0].finish_reasonβ if it'sNoneor shows an unexpected state, that prompt had issues. You could also checklen(outputs)vslen(prompts)to see if any were dropped.
Quiz
Answer these questions based on today's material. Try to answer each question before revealing the answer.
Q1: What does LLM.__init__() actually do with its parameters? Where does the heavy lifting happen?
Answer
It packs parameters into EngineArgs and calls LLMEngine.from_engine_args(). The LLM class itself does minimal work β it's a convenience wrapper. The engine factory method parses the args into a VllmConfig, selects the right executor, loads the model, allocates the KV cache, and initializes the scheduler.
Q2: Why does _run_engine() sort its outputs by request_id before returning?
Answer
Because requests don't finish in order. Short prompts complete in fewer iterations than long ones. Since _run_engine() collects outputs as they finish, a request with request_id=5 might finish before request_id=3. Sorting by request ID restores the original prompt order so outputs[i] corresponds to prompts[i].
Q3: What is the default value of max_tokens in SamplingParams, and why might this surprise users?
Answer
The default is 16 tokens. This is much smaller than most users expect (GPT-4 defaults to ~4096). If your outputs seem cut short, you probably need to set max_tokens explicitly. The low default is intentional β it prevents accidental resource exhaustion when experimenting.
Q4: What happens internally to SamplingParams when you set temperature=0?
Answer
vLLM forces greedy sampling parameters. Specifically, it sets top_p=1.0, top_k=0, and min_p=0.0, then calls _verify_greedy_sampling(). This is because when temperature is zero (always pick the highest-probability token), top-p/top-k filtering is meaningless and could introduce unexpected behavior.
Q5: Why does SamplingParams inherit from msgspec.Struct instead of using a Python @dataclass?
Answer
Serialization performance. msgspec.Struct provides 10-50x faster serialization than pickle (used with dataclasses). This matters because when vLLM runs in multiprocess mode, SamplingParams is serialized with msgspec.msgpack and sent over ZMQ sockets from the frontend process to the engine core process. Faster serialization = lower per-request overhead.
Q6: What is the difference between finish_reason="stop" and finish_reason="length" in CompletionOutput?
Answer
"stop" means the model naturally stopped β it hit an EOS token, a stop string, or a stop token ID. "length" means it hit max_tokens β the model wanted to keep generating but was cut off. If you see many "length" finishes, consider increasing max_tokens.
Q7: True or false: Calling llm.generate() with a single SamplingParams and a list of 100 prompts will use the same sampling parameters for all 100 prompts.
Answer
True. When you pass a single SamplingParams (not a list), _validate_and_add_requests() replicates it: engine_params = [params] * num_requests. Each prompt gets the same sampling configuration. To use different parameters per prompt, pass a list of SamplingParams with the same length as the prompts list.
Q8: What does gpu_memory_utilization=0.9 mean, and what happens to the other 10%?
Answer
vLLM uses 90% of GPU memory for the KV cache (and model weights). The remaining 10% is reserved for PyTorch's internal allocations β temporary activation tensors, CUDA context, cuBLAS workspace, etc. If you set it too high (e.g., 0.99), you risk CUDA OOM during forward passes. If you set it too low, you waste GPU capacity.
Summary
- vLLM solves the KV cache memory waste problem via PagedAttention β block-based allocation instead of pre-allocation
- The LLM class is a thin wrapper: it creates
EngineArgs, buildsLLMEngine, and providesgenerate(),chat(),embed(), and other task methods -
generate()validates inputs, adds requests to the engine, then loopsstep()until all requests finish -
_run_engine()is a simple while-loop overllm_engine.step()β this is where continuous batching happens under the hood -
SamplingParamscontrols per-request generation with thorough validation β zero temperature forces greedy mode, invalid ranges raise errors -
RequestOutputwraps one or moreCompletionOutputobjects, each containing the generated text, token IDs, and finish reason - Next session: The Engine Layer β what
LLMEnginedoes insidestep(), howInputProcessortokenizes prompts, and howEngineCoreClientbridges to the core
Generated from my ai-study learning project.
Top comments (0)