Ben

Posted on Feb 1

vLLM — Session 2: The Engine Layer — Request Management

#vllm #llm #python #machinelearning

This is part of my vLLM learning series. In this session, I cover Step 2 (The Engine Layer).

Note: This content was generated by Claude, grounded on the actual
vLLM codebase. It is intended for personal
learning only and may contain inaccuracies. Always verify against the
original source code and official documentation.

Topic: vLLM
Date: 2026-02-01
Sections covered: Step 2 (The Engine Layer)
Prerequisites: Session 1 — LLM class, SamplingParams, generate() flow, RequestOutput

Review

In Session 1, we learned that the LLM class is a thin wrapper around LLMEngine. When you call llm.generate(), the flow is:

_validate_and_add_requests() — pairs prompts with SamplingParams
_run_engine() — loops self.llm_engine.step() until all requests finish
Returns sorted RequestOutput objects

We saw that LLM.__init__() calls LLMEngine.from_engine_args() — but we treated the engine as a black box. Today we open that box.

The key question: What happens inside llm_engine.step()? The answer involves three components: InputProcessor, EngineCoreClient, and OutputProcessor.

Today's Material

1. LLMEngine — The Orchestrator

The LLMEngine sits between the user-facing LLM class and the core scheduling/execution machinery. Its job is to:

Preprocess inputs (tokenize prompts, handle multimodal data)
Relay preprocessed requests to the engine core
Postprocess raw outputs (detokenize, format for the user)

# vllm/v1/engine/llm_engine.py
class LLMEngine:
    def __init__(self, vllm_config, executor_class, log_stats, ...):
        self.engine_core = EngineCoreClient(...)    # Talks to core
        self.input_processor = InputProcessor(...)   # Tokenize inputs
        self.output_processor = OutputProcessor(...) # Format outputs

    def add_request(self, request_id, prompt, params, ...):
        """Preprocess and send request to engine core."""

    def step(self) -> list[RequestOutput]:
        """One iteration: get outputs from core, process, return."""

    @classmethod
    def from_engine_args(cls, engine_args) -> "LLMEngine":
        """Factory: parse args -> VllmConfig -> create engine."""

Think of LLMEngine as a translator: it speaks "user language" (strings, Python objects) on one side and "engine language" (token IDs, msgspec structs) on the other.

📝 Note:
vLLM has undergone a major architectural evolution. The v1/ directory contains the current architecture. Older code in the root vllm/engine/ directory is the legacy (v0) engine. When reading code, focus on vllm/v1/ — that's where active development happens.

2. The Factory: from_engine_args()

Before exploring the runtime flow, let's see how LLMEngine gets created:

# vllm/v1/engine/llm_engine.py
@classmethod
def from_engine_args(cls, engine_args, usage_context, ...) -> "LLMEngine":
    """Factory: parse args -> VllmConfig -> create engine."""
    vllm_config = engine_args.create_engine_config()
    # vllm_config is a VllmConfig that bundles:
    #   ModelConfig, CacheConfig, ParallelConfig,
    #   SchedulerConfig, DeviceConfig, LoadConfig, ...

    executor_class = Executor.get_class(vllm_config)
    # Selects UniProcExecutor, MultiprocExecutor, or RayDistributedExecutor

    return cls(vllm_config=vllm_config,
               executor_class=executor_class,
               usage_context=usage_context, ...)

This is a classic factory pattern. The user provides simple arguments (model="meta-llama/...", tensor_parallel_size=2), and the factory:

Parses them into a structured VllmConfig
Selects the right executor class based on configuration
Constructs the engine with all dependencies wired up

Why this matters: The factory is where all of vLLM's auto-configuration happens. It determines dtype (auto-selects fp16/bf16 based on GPU capability), figures out how many blocks fit in memory, and selects the appropriate attention backend.

3. InputProcessor — From Strings to Tokens

When add_request() is called, the first thing that happens is input processing:

# vllm/v1/engine/input_processor.py
class InputProcessor:
    def __init__(self, vllm_config, tokenizer, ...):
        self.tokenizer = tokenizer
        self.mm_processor = ...  # Multimodal input processor

    def process(self, request_id, prompt, params, ...) -> EngineCoreRequest:
        """Tokenize prompt, process multimodal inputs,
           create EngineCoreRequest."""

The processor handles several input formats:

# Users can provide prompts in multiple ways:
llm.generate("Hello world")                           # Plain string
llm.generate({"prompt": "Hello world"})               # Dict with string
llm.generate({"prompt_token_ids": [15496, 995]})      # Pre-tokenized
llm.generate({                                         # Multimodal
    "prompt": "What's in this image?",
    "multi_modal_data": {"image": image_data}
})

No matter what format you use, InputProcessor.process() normalizes it into an EngineCoreRequest — the standard wire format for the engine core.

The tokenization step converts your string prompt into token IDs:

"What is the capital of France?"
    → tokenizer.encode()
    → [1, 1724, 338, 278, 7483, 310, 3444, 29973]

💡 Tip:
If you already have token IDs (e.g., from your own tokenizer or preprocessing pipeline), pass {"prompt_token_ids": [...]} to skip redundant tokenization. This saves CPU time for high-throughput applications.

4. EngineCoreRequest — The Wire Format

The output of InputProcessor is an EngineCoreRequest:

# vllm/v1/engine/__init__.py
class EngineCoreRequest(msgspec.Struct):
    request_id: str
    prompt_token_ids: list[int] | None
    mm_features: list[MultiModalFeatureSpec] | None
    sampling_params: SamplingParams | None
    pooling_params: PoolingParams | None
    eos_token_id: int | None
    arrival_time: float
    lora_request: LoRARequest | None
    cache_salt: str | None
    data_parallel_rank: int | None
    prompt_embeds: torch.Tensor | None
    client_index: int
    current_wave: int
    priority: int
    trace_headers: Mapping[str, str] | None
    resumable: bool
    external_req_id: str | None

Why is this a separate type from Request (which the scheduler uses internally)?

Separation of concerns:

EngineCoreRequest is the transport format — designed for serialization across process boundaries
Request (in the scheduler) is the runtime format — tracks mutable state like num_computed_tokens, allocated blocks, output tokens

This separation is important because vLLM can run in multiprocess mode: the FastAPI server and InputProcessor run in one process, while the EngineCore (scheduler + executor) runs in another. The EngineCoreRequest gets serialized with msgspec.msgpack.encode(), sent over a ZMQ socket, and deserialized on the other side.

Process 1 (Frontend)           Process 2 (Engine Core)
┌──────────────────┐           ┌──────────────────┐
│  InputProcessor  │           │    Scheduler     │
│       ↓          │           │       ↓          │
│ EngineCoreRequest│──ZMQ──→   │    Request       │
│                  │           │  (mutable state)  │
└──────────────────┘           └──────────────────┘

📝 Note:
Why msgspec instead of pickle or JSON? msgspec.msgpack is 10-50x faster than pickle for structured data and produces smaller payloads than JSON. For a system processing thousands of requests per second, serialization overhead directly impacts throughput. This is not premature optimization — it's a measured bottleneck.

5. EngineCoreClient — Bridging Processes

EngineCoreClient abstracts the communication between the engine layer and the engine core:

# Conceptual interface:
class EngineCoreClient:
    def add_request(self, request: EngineCoreRequest) -> None:
        """Send request to the core."""

    def get_output(self) -> list[EngineCoreOutput]:
        """Get completed/streaming outputs from the core."""

The client has two modes:

Mode	When	How it works
In-process	`LLM` class (offline)	Direct function calls to `EngineCore`
Multiprocess	API server	ZMQ sockets between processes

In in-process mode (what you get with the LLM class), EngineCoreClient directly calls methods on an EngineCore object in the same process. No serialization overhead.

In multiprocess mode (the OpenAI-compatible server), EngineCoreClient serializes requests with msgspec.msgpack, sends them over ZMQ, and the EngineCore process deserializes and processes them. This keeps the FastAPI event loop responsive while heavy inference runs in a separate process.

# Simplified view of multiprocess communication:
# Frontend process:
encoded = msgspec.msgpack.encode(engine_core_request)
zmq_socket.send(encoded)

# Engine core process:
data = zmq_socket.recv()
request = msgspec.msgpack.decode(data, type=EngineCoreRequest)
scheduler.add_request(request)

Why this matters: This two-process architecture is critical for production deployments. Without it, long-running model forward passes on the GPU would block the HTTP server from accepting new requests.

6. The step() Method — One Iteration

Now we can understand what happens in each call to llm_engine.step():

# vllm/v1/engine/llm_engine.py (simplified)
def step(self) -> list[RequestOutput]:
    # 1. Get raw outputs from the engine core
    engine_core_outputs = self.engine_core.get_output()

    # 2. Process outputs: detokenize, format, check completion
    request_outputs = self.output_processor.process_outputs(
        engine_core_outputs)

    return request_outputs

Each step() returns a list of RequestOutput objects — some may be streaming (partial), others may be finished. The _run_engine() loop in LLM collects the finished ones.

But what triggers the core to actually run inference? In in-process mode, get_output() internally calls engine_core.step() which runs the scheduler + model execution. In multiprocess mode, the engine core runs its own loop continuously, and get_output() just reads from a queue.

7. OutputProcessor — From Tokens to Text

The OutputProcessor is the mirror of InputProcessor:

# vllm/v1/engine/output_processor.py
class OutputProcessor:
    def __init__(self, tokenizer, log_stats, ...):
        self.tokenizer = tokenizer
        self.output_states: dict[str, RequestState] = {}

It receives EngineCoreOutput (raw token IDs from the core) and produces RequestOutput (user-facing results). The key operations:

Accumulate tokens — Maintains a running state per request
Detokenize — Converts token IDs back to text using the tokenizer
Handle streaming modes — CUMULATIVE returns the full text so far; DELTA returns only new tokens
Track completion — Checks finish_reason to know when a request is done

# EngineCoreOutput — what the core produces:
class EngineCoreOutput(msgspec.Struct):
    request_id: str
    new_token_ids: list[int]          # Newly generated tokens this step
    new_logprobs: LogprobsLists | None
    finish_reason: FinishReason | None  # STOP, LENGTH, ABORT, or None
    stop_reason: int | str | None
    num_cached_tokens: int = 0

The transformation:

EngineCoreOutput                          RequestOutput
┌──────────────────────┐                 ┌─────────────────────────┐
│ request_id: "req-42" │                 │ request_id: "req-42"    │
│ new_token_ids: [464] │   detokenize    │ prompt: "Hello"         │
│ finish_reason: None  │ ───────────→    │ outputs: [              │
│                      │                 │   CompletionOutput(     │
└──────────────────────┘                 │     text: " world",     │
                                         │     token_ids: [464],   │
                                         │     finish_reason: None │
                                         │   )                     │
                                         │ ]                       │
                                         │ finished: False         │
                                         └─────────────────────────┘

⚠️ Warning:
Detokenization is not trivially reversible. Many tokenizers use byte-level BPE, where a single token might represent part of a multi-byte UTF-8 character. The OutputProcessor handles these edge cases — if a token produces an incomplete character, it buffers bytes until a valid character is formed. This is why you sometimes see "garbled" output when accessing raw token_ids without proper detokenization.

8. Putting It All Together — The Full Request Lifecycle

Let's trace a request from start to finish:

User calls: llm.generate(["What is AI?"], SamplingParams(max_tokens=20))

1. LLM.generate()
   └→ _validate_and_add_requests()
      └→ llm_engine.add_request(request_id="0", prompt="What is AI?", params=...)
         └→ InputProcessor.process()
            - Tokenize: "What is AI?" → [1, 1724, 338, 319, 29902, 29973]
            - Create EngineCoreRequest(request_id="0",
                                       prompt_token_ids=[1, 1724, ...],
                                       sampling_params=...,
                                       arrival_time=time.time())
         └→ engine_core.add_request(engine_core_request)

2. LLM._run_engine()
   while has_unfinished_requests():
     └→ llm_engine.step()
        └→ engine_core.get_output()
           - Core runs: schedule → execute model → sample tokens
           - Returns EngineCoreOutput(request_id="0",
                                      new_token_ids=[23435],
                                      finish_reason=None)
        └→ OutputProcessor.process_outputs()
           - Detokenize [23435] → " Artificial"
           - Accumulate: text = " Artificial"
           - Return RequestOutput(finished=False, ...)

     ... more steps, generating tokens one at a time ...

     └→ llm_engine.step()  (final iteration)
        └→ engine_core.get_output()
           - Returns EngineCoreOutput(request_id="0",
                                      new_token_ids=[29889],
                                      finish_reason=FinishReason.LENGTH)
        └→ OutputProcessor.process_outputs()
           - Detokenize [29889] → "."
           - Accumulate: text = " Artificial intelligence is..."
           - finish_reason = "length" (hit max_tokens=20)
           - Return RequestOutput(finished=True, ...)

3. _run_engine() collects finished output, sorts by request_id, returns

📝 Note:
In practice, the engine core doesn't generate just one token per step. With continuous batching, a single step() processes tokens for all active requests simultaneously. If there are 50 active requests, one GPU forward pass generates the next token for all 50. The OutputProcessor then demultiplexes the results back to individual RequestOutput objects.

Exercises

Exercise 1: Component Identification

Difficulty: Beginner
Goal: Verify you can identify the role of each engine layer component

For each of the following operations, name which component handles it:

Converting the string "Hello world" into token IDs [15496, 995]
Deciding which requests get GPU time this iteration
Converting EngineArgs into a VllmConfig
Decoding token ID [29889] back into the string "."
Sending an EngineCoreRequest from the frontend process to the engine core process

Solution 1. **InputProcessor** — it runs the tokenizer on the raw prompt string. 2. **Scheduler** (inside the engine core) — it decides which requests to include in each step's batch. 3. **`LLMEngine.from_engine_args()`** — the factory classmethod calls `engine_args.create_engine_config()`. 4. **OutputProcessor** — it detokenizes raw token IDs back into text. 5. **EngineCoreClient** (multiprocess mode) — it serializes with `msgspec.msgpack` and sends over ZMQ.

Exercise 2: Multiprocess vs. In-Process

Difficulty: Intermediate
Goal: Understand when and why vLLM uses multiprocess communication

Consider two scenarios:

Scenario A: Offline batch processing

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(prompts, params)

Scenario B: Production API server

vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

For each scenario:

Is the EngineCoreClient in in-process or multiprocess mode?
Does the EngineCoreRequest actually get serialized with msgspec?
What would happen if the API server ran the engine core in-process (same event loop)?

Solution

Scenario A (offline LLM class):

In-process — direct function calls to EngineCore.
No — the EngineCoreRequest is created but passed directly without serialization.
N/A — there's no HTTP server.

Scenario B (API server):

Multiprocess — ZMQ sockets between frontend and engine core processes.
Yes — msgspec.msgpack.encode() serializes it, sends over ZMQ, and the core deserializes it.
The HTTP server would block during GPU forward passes. A single inference step can take 10-100ms, during which the server couldn't accept new connections or respond to health checks. Under load, this would cause request timeouts and dropped connections.

Exercise 3: Tracing Data Transformations

Difficulty: Intermediate
Goal: Follow the data as it changes form through the pipeline

Starting with this call:

llm.generate(
    [{"prompt_token_ids": [1, 2, 3, 4, 5]}],
    SamplingParams(max_tokens=3, temperature=0)
)

Does InputProcessor tokenize this prompt? Why or why not?
What fields of EngineCoreRequest are set? What's arrival_time used for?
If the model generates tokens [100, 200, 300], what does the EngineCoreOutput for the final step look like?
What is finish_reason and why?

Solution

No. The input is {"prompt_token_ids": [1, 2, 3, 4, 5]} — already tokenized. InputProcessor detects the prompt_token_ids key and skips tokenization, using the provided IDs directly.
Key fields: request_id (auto-assigned), prompt_token_ids=[1, 2, 3, 4, 5], sampling_params (with max_tokens=3, temperature=0), arrival_time=time.time(). arrival_time is used by the scheduler for FCFS ordering and for latency metrics.
The final EngineCoreOutput would be: EngineCoreOutput(request_id="0", new_token_ids=[300], finish_reason=FinishReason.LENGTH, stop_reason=None). Each step produces one new token, and the third token triggers the length limit.
FinishReason.LENGTH — the model generated exactly max_tokens=3 tokens ([100, 200, 300]) and was stopped. It didn't hit an EOS or stop token naturally.

Exercise 4: Design Challenge — Adding Request Priority

Difficulty: Advanced
Goal: Think through how a feature propagates through the engine layer

Suppose you want to add priority-based scheduling: high-priority requests should be processed before low-priority ones. Trace through the architecture:

Where does the user specify priority? (Hint: look at LLM.generate() parameters)
How does priority get from the user to the scheduler? List each class it passes through.
Why is priority a field on EngineCoreRequest rather than just on SamplingParams?
What would happen if the OutputProcessor also needed to know about priority? Would the current architecture support that?

Hint: Priority is already partially implemented — look at the EngineCoreRequest fields.

Solution

Via the priority parameter in LLM.generate(prompts, params, priority=[1, 2, ...]).
The path is: LLM.generate() → _validate_and_add_requests() → LLMEngine.add_request() → InputProcessor.process() (sets the priority field on EngineCoreRequest) → EngineCoreClient.add_request() → Scheduler (reads priority from the request).
Priority is a request-level concept, not a generation-level concept. SamplingParams controls how tokens are sampled (temperature, top-p, etc.) — it's about the quality of the output. Priority controls when the request gets scheduled — it's about resource allocation. Mixing them would conflate two different concerns.
Yes — the OutputProcessor receives EngineCoreOutput which includes the request_id. It could look up priority from its internal state (it already maintains per-request RequestState). But currently it doesn't need to — priority only matters for scheduling decisions.

Exercise 5: Streaming Output Modes

Difficulty: Advanced
Goal: Understand the difference between CUMULATIVE and DELTA output modes

Given a request that generates the text "Hello world!" as three tokens:

Step 1: token "Hello"
Step 2: token " world"
Step 3: token "!"

Write out what RequestOutput.outputs[0].text contains at each step for:

output_kind = RequestOutputKind.CUMULATIVE
output_kind = RequestOutputKind.DELTA

When would you use each mode? Think about a streaming chat UI vs. a batch processing pipeline.

Solution

CUMULATIVE:

Step 1: "Hello"
Step 2: "Hello world"
Step 3: "Hello world!"

DELTA:

Step 1: "Hello"
Step 2: " world"
Step 3: "!"

When to use each: DELTA is ideal for streaming chat UIs — you append each delta directly to the display. CUMULATIVE is simpler for batch pipelines — you always have the full text so far, no need to track previous outputs. CUMULATIVE is the default because it's easier to use correctly.

Quiz

Answer these questions based on today's material. Try to answer each question before revealing the answer.

Q1: What are the three main components inside LLMEngine, and what does each one do?

Answer

InputProcessor, EngineCoreClient, and OutputProcessor. InputProcessor tokenizes prompts and creates EngineCoreRequest objects. EngineCoreClient sends requests to and receives outputs from the engine core (either in-process or via ZMQ). OutputProcessor detokenizes raw token IDs back into text and formats RequestOutput objects for the user.

Q2: Why does vLLM have both EngineCoreRequest and Request as separate types?

Answer

They serve different purposes across a process boundary. EngineCoreRequest is the transport/wire format — immutable, serializable with msgspec, designed to cross process boundaries efficiently. Request is the scheduler's internal runtime format — mutable, tracks state like num_computed_tokens, allocated KV cache blocks, and output tokens. Mixing these concerns would either make serialization expensive or make runtime tracking awkward.

Q3: What serialization format does vLLM use for inter-process communication, and why was it chosen over alternatives like pickle or JSON?

Answer

msgspec.msgpack — a binary MessagePack format. It's 10-50x faster than pickle for structured data and produces compact binary payloads. JSON was rejected because it's text-based (larger payloads, slower parsing). Pickle was rejected because it's slow for structured data and has security concerns. At thousands of requests per second, serialization overhead is a real bottleneck.

Q4: In multiprocess mode, what happens if the engine core is busy running a forward pass when a new HTTP request arrives?

Answer

The new request is accepted by the frontend process and queued. Because the frontend (FastAPI + InputProcessor) runs in a separate process from the engine core, it can accept and preprocess new HTTP requests while the GPU is busy. The EngineCoreRequest is sent over ZMQ and queued for the next scheduling iteration. This is exactly why the two-process architecture exists.

Q5: What does OutputProcessor do when it receives a token that represents an incomplete UTF-8 character?

Answer

It buffers the incomplete bytes until a valid character is formed. Many tokenizers use byte-level BPE, where tokens can split in the middle of multi-byte UTF-8 characters (e.g., emoji, CJK characters). The OutputProcessor accumulates bytes and only emits text when complete characters are available. This prevents garbled output in streaming responses.

Q6: True or false: In in-process mode (using the LLM class), EngineCoreRequest is still created even though it doesn't need to be serialized.

Answer

True. The InputProcessor always creates an EngineCoreRequest regardless of execution mode. In in-process mode, the request is passed directly to the engine core without serialization. The EngineCoreRequest type serves as a clean interface contract between the engine layer and the core, even when no process boundary exists.

Q7: What is the purpose of arrival_time in EngineCoreRequest?

Answer

It records when the request was submitted, enabling scheduling policies like FCFS (first-come-first-served). The scheduler can use arrival_time to prioritize older requests over newer ones. It's also used for metrics: you can measure end-to-end latency by comparing arrival_time with the completion time. Without it, the scheduler would have no notion of fairness or request ordering.

Q8: Why does LLMEngine.from_engine_args() exist as a classmethod factory instead of putting all the logic in __init__?

Answer

To separate argument parsing from construction. The factory method converts user-friendly EngineArgs (flat key-value pairs) into a structured VllmConfig (nested, validated configuration), selects the right executor class, and then calls __init__. This keeps __init__ simple — it receives fully validated, structured objects. It also allows alternative construction paths (e.g., creating LLMEngine directly with a VllmConfig for testing).

Summary

LLMEngine is the orchestrator that connects the user-facing API to the engine core, with three sub-components: InputProcessor, EngineCoreClient, and OutputProcessor
InputProcessor normalizes various input formats (strings, token IDs, multimodal data) into EngineCoreRequest — the standard wire format
EngineCoreRequest uses msgspec.Struct for fast serialization, enabling efficient multiprocess communication via ZMQ
EngineCoreClient abstracts the communication mode: in-process for offline use, multiprocess (ZMQ) for production servers
OutputProcessor reverses the input pipeline: accumulates tokens, detokenizes, handles streaming modes (CUMULATIVE vs DELTA), and produces RequestOutput
The two-process architecture (frontend + engine core) is critical for production: it keeps the HTTP server responsive while the GPU runs inference
Next session: The Scheduler — how vLLM decides which requests get GPU time, the token budget system, and chunked prefill

Generated from my ai-study learning project.

DEV Community

vLLM — Session 2: The Engine Layer — Request Management

Review

Today's Material

1. LLMEngine — The Orchestrator

2. The Factory: from_engine_args()

3. InputProcessor — From Strings to Tokens

4. EngineCoreRequest — The Wire Format

5. EngineCoreClient — Bridging Processes

6. The step() Method — One Iteration

7. OutputProcessor — From Tokens to Text

8. Putting It All Together — The Full Request Lifecycle

Exercises

Exercise 1: Component Identification

Exercise 2: Multiprocess vs. In-Process

Exercise 3: Tracing Data Transformations

Exercise 4: Design Challenge — Adding Request Priority

Exercise 5: Streaming Output Modes

Quiz

Summary

Top comments (0)