DEV Community

Ben
Ben

Posted on

Session 1: vLLM Overview and the User API

This is part of my vLLM learning series. In this session, I cover Step 1 (The User API).

Note: This content was generated by Claude, grounded on the actual
vLLM codebase. It is intended for personal
learning only and may contain inaccuracies. Always verify against the
original source code and official documentation.

Topic: vLLM
Date: 2026-01-31
Sections covered: Step 1 (The User API)
Prerequisites: None


Today's Material

1. What is vLLM and Why Does It Matter?

LLM inference is GPU-memory-bound. When a model generates text, it needs to store key-value (KV) caches β€” intermediate computations from the attention mechanism β€” for every token in every active request. Naive implementations pre-allocate the maximum possible sequence length for each request, wasting 60-80% of GPU memory on empty space.

vLLM solves this with PagedAttention: instead of pre-allocating a giant contiguous buffer per request, it carves GPU memory into fixed-size blocks (default 16 tokens each) and allocates them on demand β€” just like how an operating system manages virtual memory with pages.

The result: near-optimal memory utilization and 2-4x higher throughput than HuggingFace Transformers on typical workloads.

πŸ“ Note:
Think of the difference like this: the naive approach is like reserving an entire row of seats in a theater for each person "just in case" they bring friends. PagedAttention is like assigning individual seats as people actually show up.

2. High-Level Architecture

Before diving into code, here's the bird's-eye view of how vLLM is organized:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              User-Facing Layer                β”‚
β”‚   LLM class  |  OpenAI API Server  |  gRPC   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Engine Layer                     β”‚
β”‚  InputProcessor β†’ EngineCoreClient β†’ OutputProcessor β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Engine Core                      β”‚
β”‚   Scheduler β†’ Executor β†’ Workers β†’ GPU        β”‚
β”‚      └── KVCacheManager (BlockPool)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Three layers:

  1. User-Facing β€” Multiple entry points (Python API, HTTP, gRPC) that all funnel into the engine
  2. Engine Layer β€” Tokenize inputs, relay to the core, format outputs
  3. Engine Core β€” The scheduling loop, KV cache management, and GPU execution

Today we focus on layer 1: the LLM class and its associated types.

3. The LLM Class β€” Your Main Interface

The LLM class in vllm/entrypoints/llm.py is the primary interface for offline batch inference. Here's its constructor (simplified to the most important parameters):

# vllm/entrypoints/llm.py
class LLM:
    def __init__(
        self,
        model: str,
        *,
        tokenizer: str | None = None,
        tensor_parallel_size: int = 1,
        dtype: ModelDType = "auto",
        quantization: QuantizationMethods | None = None,
        gpu_memory_utilization: float = 0.9,
        seed: int = 0,
        **kwargs: Any,
    ) -> None:
        engine_args = EngineArgs(
            model=model,
            tokenizer=tokenizer,
            tensor_parallel_size=tensor_parallel_size,
            dtype=dtype,
            quantization=quantization,
            gpu_memory_utilization=gpu_memory_utilization,
            **kwargs,
        )

        self.llm_engine = LLMEngine.from_engine_args(
            engine_args=engine_args,
            usage_context=UsageContext.LLM_CLASS,
        )
        self.request_counter = Counter()
Enter fullscreen mode Exit fullscreen mode

Key things to notice:

  • LLM is thin β€” it creates an EngineArgs config, then hands everything off to LLMEngine.from_engine_args()
  • gpu_memory_utilization=0.9 means vLLM claims 90% of GPU memory for the KV cache, reserving 10% for PyTorch overhead (model weights, activations, etc.)
  • tensor_parallel_size controls how many GPUs to shard the model across β€” set to 1 for single-GPU

πŸ’‘ Tip:
If you get CUDA out-of-memory errors, lower gpu_memory_utilization (e.g., to 0.8). If you want more throughput and have headroom, raise it (up to ~0.95).

4. The generate() Method β€” Where Requests Enter

# vllm/entrypoints/llm.py
def generate(
    self,
    prompts: PromptType | Sequence[PromptType],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    *,
    use_tqdm: bool | Callable[..., tqdm] = True,
    lora_request: list[LoRARequest] | LoRARequest | None = None,
    priority: list[int] | None = None,
) -> list[RequestOutput]:
    model_config = self.model_config
    runner_type = model_config.runner_type
    if runner_type != "generate":
        raise ValueError(
            "LLM.generate() is only supported for generative models.")

    if sampling_params is None:
        sampling_params = self.get_default_sampling_params()

    self._validate_and_add_requests(
        prompts=prompts,
        params=sampling_params,
        use_tqdm=use_tqdm,
        lora_request=...,
        priority=priority,
    )

    outputs = self._run_engine(use_tqdm=use_tqdm)
    return self.engine_class.validate_outputs(outputs, RequestOutput)
Enter fullscreen mode Exit fullscreen mode

The flow is:

  1. Validate that this is a generative model (not an embedding model)
  2. Add requests via _validate_and_add_requests() β€” normalizes inputs, pairs each prompt with its SamplingParams, and sends them to the engine
  3. Run the engine via _run_engine() β€” loops until all requests are finished
  4. Return sorted RequestOutput objects

You can pass a single SamplingParams (applied to all prompts) or a list (one per prompt). This is useful when different prompts need different temperatures or stop conditions.

5. _run_engine() β€” The Processing Loop

This is where the actual inference happens:

# vllm/entrypoints/llm.py
def _run_engine(
    self, *, use_tqdm: bool | Callable[..., tqdm] = True
) -> list[RequestOutput | PoolingRequestOutput]:
    outputs: list[RequestOutput | PoolingRequestOutput] = []
    total_in_toks = 0
    total_out_toks = 0

    while self.llm_engine.has_unfinished_requests():
        step_outputs = self.llm_engine.step()
        for output in step_outputs:
            if output.finished:
                outputs.append(output)

    # Sort the outputs by request ID.
    # This is necessary because some requests may be finished earlier than
    # its previous requests.
    return sorted(outputs, key=lambda x: int(x.request_id))
Enter fullscreen mode Exit fullscreen mode

The key insight: _run_engine() is a simple loop. It calls self.llm_engine.step() repeatedly. Each step() runs one iteration of the scheduling + inference pipeline β€” potentially processing hundreds of requests in a single forward pass. Finished requests come back as RequestOutput objects.

⚠️ Warning:
The outputs are sorted by request_id at the end because requests don't finish in order. A short request (e.g., "Say hi") may finish in 5 iterations while a long request (e.g., "Write an essay") takes 500. The sorting ensures the output list matches the input prompt order.

Why this matters: This loop is where continuous batching happens. Unlike static batching (process N prompts, wait for all to finish, return), vLLM processes requests at different stages simultaneously. Request A might be mid-generation while Request B is just starting its prefill.

6. SamplingParams β€” Controlling Generation

Every request carries a SamplingParams that controls how tokens are selected:

# vllm/sampling_params.py
class SamplingParams(
    PydanticMsgspecMixin,
    msgspec.Struct,
    omit_defaults=True,
    dict=True,
):
    # --- Core sampling ---
    n: int = 1                          # Number of output sequences
    temperature: float = 1.0            # 0 = greedy, higher = more random
    top_p: float = 1.0                  # Nucleus sampling threshold
    top_k: int = 0                      # Top-K filtering (0 = disabled)
    min_p: float = 0.0                  # Minimum probability threshold
    seed: int | None = None             # Reproducible sampling

    # --- Penalties ---
    presence_penalty: float = 0.0       # Penalize tokens that appeared
    frequency_penalty: float = 0.0      # Penalize by frequency
    repetition_penalty: float = 1.0     # Multiplicative penalty

    # --- Generation limits ---
    max_tokens: int | None = 16         # Output length limit
    min_tokens: int = 0                 # Minimum before allowing EOS
    ignore_eos: bool = False            # Don't stop at EOS

    # --- Stop conditions ---
    stop: str | list[str] | None = None
    stop_token_ids: list[int] | None = None

    # --- Output control ---
    logprobs: int | None = None         # Return top-N log probabilities
    prompt_logprobs: int | None = None  # Prompt token log probs
    detokenize: bool = True             # Decode token IDs to text

    # --- Advanced ---
    structured_outputs: StructuredOutputsParams | None = None  # JSON schema
    logit_bias: dict[int, float] | None = None
    output_kind: RequestOutputKind = RequestOutputKind.CUMULATIVE
Enter fullscreen mode Exit fullscreen mode

Notice that SamplingParams inherits from msgspec.Struct, not a Python dataclass. This is a deliberate performance choice β€” msgspec serialization is 10-50x faster than pickle, which matters when requests cross process boundaries (more on this in a future session).

Validation logic

SamplingParams.__post_init__() enforces constraints:

# vllm/sampling_params.py
def __post_init__(self) -> None:
    # Normalize stop to a list
    if self.stop is None:
        self.stop = []
    elif isinstance(self.stop, str):
        self.stop = [self.stop]

    # Zero temperature β†’ force greedy sampling
    if self.temperature < _SAMPLING_EPS:
        self.top_p = 1.0
        self.top_k = 0
        self.min_p = 0.0
        self._verify_greedy_sampling()

    self._verify_args()

def _verify_args(self) -> None:
    if self.n < 1:
        raise ValueError(f"n must be at least 1, got {self.n}.")
    if not -2.0 <= self.presence_penalty <= 2.0:
        raise ValueError(...)
    if self.temperature < 0.0:
        raise VLLMValidationError(...)
    if not 0.0 < self.top_p <= 1.0:
        raise VLLMValidationError(...)
    if self.max_tokens is not None and self.max_tokens < 1:
        raise VLLMValidationError(...)
Enter fullscreen mode Exit fullscreen mode

πŸ“ Note:
When temperature=0, vLLM automatically sets top_p=1.0, top_k=0, and min_p=0.0. This is because greedy decoding (always pick the highest-probability token) makes all other sampling parameters irrelevant. The code enforces this rather than letting the user set contradictory values.

7. RequestOutput and CompletionOutput β€” What You Get Back

After generate() finishes, you get a list of RequestOutput objects:

# vllm/outputs.py
class RequestOutput:
    def __init__(
        self,
        request_id: str,
        prompt: str | None,
        prompt_token_ids: list[int] | None,
        prompt_logprobs: PromptLogprobs | None,
        outputs: list[CompletionOutput],
        finished: bool,
        metrics: RequestStateStats | None = None,
        num_cached_tokens: int | None = None,
        ...
    ): ...
Enter fullscreen mode Exit fullscreen mode

Each RequestOutput contains one or more CompletionOutput objects (one per n in SamplingParams):

# vllm/outputs.py
@dataclass
class CompletionOutput:
    index: int                         # Which of the n outputs
    text: str                          # Generated text
    token_ids: GenericSequence[int]    # Generated token IDs
    cumulative_logprob: float | None   # Sum of log probs
    logprobs: SampleLogprobs | None    # Per-token log probs
    finish_reason: str | None          # "stop", "length", or None
    stop_reason: int | str | None      # What triggered the stop

    def finished(self) -> bool:
        return self.finish_reason is not None
Enter fullscreen mode Exit fullscreen mode

A typical usage pattern:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

outputs = llm.generate(
    ["What is the capital of France?", "Explain gravity in one sentence."],
    SamplingParams(temperature=0.7, max_tokens=64)
)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    reason = output.outputs[0].finish_reason  # "stop" or "length"
    print(f"Prompt: {prompt}")
    print(f"Output: {generated}")
    print(f"Finished because: {reason}\n")
Enter fullscreen mode Exit fullscreen mode

When n > 1, you get multiple completions per prompt:

outputs = llm.generate(
    ["Tell me a joke."],
    SamplingParams(n=3, temperature=0.9, max_tokens=100)
)

# outputs[0].outputs has 3 CompletionOutput objects
for completion in outputs[0].outputs:
    print(f"Completion {completion.index}: {completion.text}")
Enter fullscreen mode Exit fullscreen mode

8. Beyond generate() β€” Other Task Types

The LLM class supports more than text generation:

# Chat (applies chat template automatically)
outputs = llm.chat(
    messages=[{"role": "user", "content": "What is 2+2?"}],
    sampling_params=SamplingParams(max_tokens=32)
)

# Embeddings
outputs = llm.embed(["Hello world", "Goodbye world"])

# Classification (not all models support this)
outputs = llm.classify(["This movie was great!", "Terrible film."])

# Scoring (cross-encoder style)
outputs = llm.score("query text", ["doc1", "doc2", "doc3"])
Enter fullscreen mode Exit fullscreen mode

Each method validates that the loaded model supports the requested task via runner_type. If you try llm.generate() on an embedding model, you get a clear error.


Exercises

Exercise 1: Basic Generation

Difficulty: Beginner
Goal: Understand the relationship between SamplingParams and output

Given this code:

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0, max_tokens=5, n=2)
outputs = llm.generate(["Count from 1 to 10."], params)
Enter fullscreen mode Exit fullscreen mode
  1. What happens when temperature=0 and n=2? Will the two completions be different or identical? Why?
  2. What will finish_reason be for each completion? ("stop" or "length")
  3. How many CompletionOutput objects will be in outputs[0].outputs?

Hint: Think about what greedy decoding means for multiple samples.

Solution

  1. Identical. Temperature=0 means greedy decoding β€” always pick the highest-probability token. With no randomness, every sample produces the exact same sequence. Running n=2 with greedy is wasteful.
  2. "length" for both. max_tokens=5 will cut off "Count from 1 to 10" well before the model naturally stops β€” it would need at least ~20 tokens ("1, 2, 3, 4, 5, 6, 7, 8, 9, 10").
  3. 2 β€” one per n. outputs[0].outputs[0] and outputs[0].outputs[1], though both will have the same text.

Exercise 2: Trace the Call Path

Difficulty: Intermediate
Goal: Map the execution flow from user call to engine loop

Trace what happens when this code executes:

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(
    ["Hello", "World"],
    SamplingParams(max_tokens=10)
)
Enter fullscreen mode Exit fullscreen mode

For each step, name the method and describe what it does:

  1. What does generate() call first?
  2. How are the two prompts and the single SamplingParams paired?
  3. What does _run_engine() do on each iteration?
  4. Why are outputs sorted at the end?

Solution

  1. generate() first validates the runner type is "generate", then calls _validate_and_add_requests().
  2. The single SamplingParams is replicated: [params] * num_requests β€” so both "Hello" and "World" get the same max_tokens=10.
  3. Each iteration calls self.llm_engine.step(), which runs one scheduling + inference cycle. Finished requests are collected into the outputs list.
  4. Because requests finish out of order. "Hello" (shorter) might finish before "World" (or vice versa depending on generation). Sorting by request_id ensures outputs[0] corresponds to "Hello" and outputs[1] to "World".

Exercise 3: SamplingParams Edge Cases

Difficulty: Intermediate
Goal: Understand validation and normalization

What happens in each case? Does it succeed, raise an error, or get silently normalized?

  1. SamplingParams(temperature=-0.5)
  2. SamplingParams(temperature=0, top_k=50)
  3. SamplingParams(top_p=0.0)
  4. SamplingParams(stop="END", include_stop_str_in_output=False) β€” what does output_text_buffer_length get set to?
  5. SamplingParams(max_tokens=0)

Solution

  1. Raises VLLMValidationError β€” _verify_args() checks self.temperature &lt; 0.0.
  2. Silently normalized. When temperature=0, __post_init__ forces top_k=0 (along with top_p=1.0, min_p=0.0). Your top_k=50 is overwritten.
  3. Raises VLLMValidationError β€” _verify_args() checks not 0.0 &lt; self.top_p &lt;= 1.0. Zero is not in the valid range.
  4. output_text_buffer_length is set to len("END") - 1 = 2. This buffer ensures the output processor doesn't emit text that might be part of the stop string before the full match is determined.
  5. Raises VLLMValidationError β€” _verify_args() checks self.max_tokens &lt; 1 when not None.

Exercise 4: Design a Batch Inference Script

Difficulty: Advanced
Goal: Apply what you've learned to a realistic scenario

You have a file with 10,000 prompts (one per line). You need to generate completions with temperature=0.8 and max_tokens=256, saving results to a JSON file. Design the script:

  1. Should you call generate() once with all 10,000 prompts, or in batches of 100? Why?
  2. How would you handle prompts that need different max_tokens?
  3. If 3 out of 10,000 prompts fail, how would you know which ones? (Hint: look at request_id)

Solution

  1. Call generate() once with all 10,000. vLLM's continuous batching handles scheduling internally β€” it dynamically fits as many requests as GPU memory allows per step. Breaking into batches of 100 would serialize work unnecessarily and prevent vLLM from optimally utilizing the GPU.
  2. Pass a list of SamplingParams, one per prompt: [SamplingParams(max_tokens=t) for t in per_prompt_max_tokens]. This lets each prompt have its own configuration.
  3. Match by index. Since generate() sorts outputs by request_id (which maps to the input order), outputs[i] corresponds to prompts[i]. Check outputs[i].outputs[0].finish_reason β€” if it's None or shows an unexpected state, that prompt had issues. You could also check len(outputs) vs len(prompts) to see if any were dropped.

Quiz

Answer these questions based on today's material. Try to answer each question before revealing the answer.

Q1: What does LLM.__init__() actually do with its parameters? Where does the heavy lifting happen?

Answer

It packs parameters into EngineArgs and calls LLMEngine.from_engine_args(). The LLM class itself does minimal work β€” it's a convenience wrapper. The engine factory method parses the args into a VllmConfig, selects the right executor, loads the model, allocates the KV cache, and initializes the scheduler.

Q2: Why does _run_engine() sort its outputs by request_id before returning?

Answer

Because requests don't finish in order. Short prompts complete in fewer iterations than long ones. Since _run_engine() collects outputs as they finish, a request with request_id=5 might finish before request_id=3. Sorting by request ID restores the original prompt order so outputs[i] corresponds to prompts[i].

Q3: What is the default value of max_tokens in SamplingParams, and why might this surprise users?

Answer

The default is 16 tokens. This is much smaller than most users expect (GPT-4 defaults to ~4096). If your outputs seem cut short, you probably need to set max_tokens explicitly. The low default is intentional β€” it prevents accidental resource exhaustion when experimenting.

Q4: What happens internally to SamplingParams when you set temperature=0?

Answer

vLLM forces greedy sampling parameters. Specifically, it sets top_p=1.0, top_k=0, and min_p=0.0, then calls _verify_greedy_sampling(). This is because when temperature is zero (always pick the highest-probability token), top-p/top-k filtering is meaningless and could introduce unexpected behavior.

Q5: Why does SamplingParams inherit from msgspec.Struct instead of using a Python @dataclass?

Answer

Serialization performance. msgspec.Struct provides 10-50x faster serialization than pickle (used with dataclasses). This matters because when vLLM runs in multiprocess mode, SamplingParams is serialized with msgspec.msgpack and sent over ZMQ sockets from the frontend process to the engine core process. Faster serialization = lower per-request overhead.

Q6: What is the difference between finish_reason="stop" and finish_reason="length" in CompletionOutput?

Answer

"stop" means the model naturally stopped β€” it hit an EOS token, a stop string, or a stop token ID. "length" means it hit max_tokens β€” the model wanted to keep generating but was cut off. If you see many "length" finishes, consider increasing max_tokens.

Q7: True or false: Calling llm.generate() with a single SamplingParams and a list of 100 prompts will use the same sampling parameters for all 100 prompts.

Answer

True. When you pass a single SamplingParams (not a list), _validate_and_add_requests() replicates it: engine_params = [params] * num_requests. Each prompt gets the same sampling configuration. To use different parameters per prompt, pass a list of SamplingParams with the same length as the prompts list.

Q8: What does gpu_memory_utilization=0.9 mean, and what happens to the other 10%?

Answer

vLLM uses 90% of GPU memory for the KV cache (and model weights). The remaining 10% is reserved for PyTorch's internal allocations β€” temporary activation tensors, CUDA context, cuBLAS workspace, etc. If you set it too high (e.g., 0.99), you risk CUDA OOM during forward passes. If you set it too low, you waste GPU capacity.


Summary

  • vLLM solves the KV cache memory waste problem via PagedAttention β€” block-based allocation instead of pre-allocation
  • The LLM class is a thin wrapper: it creates EngineArgs, builds LLMEngine, and provides generate(), chat(), embed(), and other task methods
  • generate() validates inputs, adds requests to the engine, then loops step() until all requests finish
  • _run_engine() is a simple while-loop over llm_engine.step() β€” this is where continuous batching happens under the hood
  • SamplingParams controls per-request generation with thorough validation β€” zero temperature forces greedy mode, invalid ranges raise errors
  • RequestOutput wraps one or more CompletionOutput objects, each containing the generated text, token IDs, and finish reason
  • Next session: The Engine Layer β€” what LLMEngine does inside step(), how InputProcessor tokenizes prompts, and how EngineCoreClient bridges to the core

Generated from my ai-study learning project.

Top comments (0)