This is part of my vLLM learning series. In this session, I cover Step 2 (The Engine Layer).
Note: This content was generated by Claude, grounded on the actual
vLLM codebase. It is intended for personal
learning only and may contain inaccuracies. Always verify against the
original source code and official documentation.
Topic: vLLM
Date: 2026-02-01
Sections covered: Step 2 (The Engine Layer)
Prerequisites: Session 1 — LLM class, SamplingParams, generate() flow, RequestOutput
Review
In Session 1, we learned that the LLM class is a thin wrapper around LLMEngine. When you call llm.generate(), the flow is:
-
_validate_and_add_requests()— pairs prompts withSamplingParams -
_run_engine()— loopsself.llm_engine.step()until all requests finish - Returns sorted
RequestOutputobjects
We saw that LLM.__init__() calls LLMEngine.from_engine_args() — but we treated the engine as a black box. Today we open that box.
The key question: What happens inside llm_engine.step()? The answer involves three components: InputProcessor, EngineCoreClient, and OutputProcessor.
Today's Material
1. LLMEngine — The Orchestrator
The LLMEngine sits between the user-facing LLM class and the core scheduling/execution machinery. Its job is to:
- Preprocess inputs (tokenize prompts, handle multimodal data)
- Relay preprocessed requests to the engine core
- Postprocess raw outputs (detokenize, format for the user)
# vllm/v1/engine/llm_engine.py
class LLMEngine:
def __init__(self, vllm_config, executor_class, log_stats, ...):
self.engine_core = EngineCoreClient(...) # Talks to core
self.input_processor = InputProcessor(...) # Tokenize inputs
self.output_processor = OutputProcessor(...) # Format outputs
def add_request(self, request_id, prompt, params, ...):
"""Preprocess and send request to engine core."""
def step(self) -> list[RequestOutput]:
"""One iteration: get outputs from core, process, return."""
@classmethod
def from_engine_args(cls, engine_args) -> "LLMEngine":
"""Factory: parse args -> VllmConfig -> create engine."""
Think of LLMEngine as a translator: it speaks "user language" (strings, Python objects) on one side and "engine language" (token IDs, msgspec structs) on the other.
📝 Note:
vLLM has undergone a major architectural evolution. Thev1/directory contains the current architecture. Older code in the rootvllm/engine/directory is the legacy (v0) engine. When reading code, focus onvllm/v1/— that's where active development happens.
2. The Factory: from_engine_args()
Before exploring the runtime flow, let's see how LLMEngine gets created:
# vllm/v1/engine/llm_engine.py
@classmethod
def from_engine_args(cls, engine_args, usage_context, ...) -> "LLMEngine":
"""Factory: parse args -> VllmConfig -> create engine."""
vllm_config = engine_args.create_engine_config()
# vllm_config is a VllmConfig that bundles:
# ModelConfig, CacheConfig, ParallelConfig,
# SchedulerConfig, DeviceConfig, LoadConfig, ...
executor_class = Executor.get_class(vllm_config)
# Selects UniProcExecutor, MultiprocExecutor, or RayDistributedExecutor
return cls(vllm_config=vllm_config,
executor_class=executor_class,
usage_context=usage_context, ...)
This is a classic factory pattern. The user provides simple arguments (model="meta-llama/...", tensor_parallel_size=2), and the factory:
- Parses them into a structured
VllmConfig - Selects the right executor class based on configuration
- Constructs the engine with all dependencies wired up
Why this matters: The factory is where all of vLLM's auto-configuration happens. It determines dtype (auto-selects fp16/bf16 based on GPU capability), figures out how many blocks fit in memory, and selects the appropriate attention backend.
3. InputProcessor — From Strings to Tokens
When add_request() is called, the first thing that happens is input processing:
# vllm/v1/engine/input_processor.py
class InputProcessor:
def __init__(self, vllm_config, tokenizer, ...):
self.tokenizer = tokenizer
self.mm_processor = ... # Multimodal input processor
def process(self, request_id, prompt, params, ...) -> EngineCoreRequest:
"""Tokenize prompt, process multimodal inputs,
create EngineCoreRequest."""
The processor handles several input formats:
# Users can provide prompts in multiple ways:
llm.generate("Hello world") # Plain string
llm.generate({"prompt": "Hello world"}) # Dict with string
llm.generate({"prompt_token_ids": [15496, 995]}) # Pre-tokenized
llm.generate({ # Multimodal
"prompt": "What's in this image?",
"multi_modal_data": {"image": image_data}
})
No matter what format you use, InputProcessor.process() normalizes it into an EngineCoreRequest — the standard wire format for the engine core.
The tokenization step converts your string prompt into token IDs:
"What is the capital of France?"
→ tokenizer.encode()
→ [1, 1724, 338, 278, 7483, 310, 3444, 29973]
💡 Tip:
If you already have token IDs (e.g., from your own tokenizer or preprocessing pipeline), pass{"prompt_token_ids": [...]}to skip redundant tokenization. This saves CPU time for high-throughput applications.
4. EngineCoreRequest — The Wire Format
The output of InputProcessor is an EngineCoreRequest:
# vllm/v1/engine/__init__.py
class EngineCoreRequest(msgspec.Struct):
request_id: str
prompt_token_ids: list[int] | None
mm_features: list[MultiModalFeatureSpec] | None
sampling_params: SamplingParams | None
pooling_params: PoolingParams | None
eos_token_id: int | None
arrival_time: float
lora_request: LoRARequest | None
cache_salt: str | None
data_parallel_rank: int | None
prompt_embeds: torch.Tensor | None
client_index: int
current_wave: int
priority: int
trace_headers: Mapping[str, str] | None
resumable: bool
external_req_id: str | None
Why is this a separate type from Request (which the scheduler uses internally)?
Separation of concerns:
-
EngineCoreRequestis the transport format — designed for serialization across process boundaries -
Request(in the scheduler) is the runtime format — tracks mutable state likenum_computed_tokens, allocated blocks, output tokens
This separation is important because vLLM can run in multiprocess mode: the FastAPI server and InputProcessor run in one process, while the EngineCore (scheduler + executor) runs in another. The EngineCoreRequest gets serialized with msgspec.msgpack.encode(), sent over a ZMQ socket, and deserialized on the other side.
Process 1 (Frontend) Process 2 (Engine Core)
┌──────────────────┐ ┌──────────────────┐
│ InputProcessor │ │ Scheduler │
│ ↓ │ │ ↓ │
│ EngineCoreRequest│──ZMQ──→ │ Request │
│ │ │ (mutable state) │
└──────────────────┘ └──────────────────┘
📝 Note:
Why msgspec instead of pickle or JSON?msgspec.msgpackis 10-50x faster than pickle for structured data and produces smaller payloads than JSON. For a system processing thousands of requests per second, serialization overhead directly impacts throughput. This is not premature optimization — it's a measured bottleneck.
5. EngineCoreClient — Bridging Processes
EngineCoreClient abstracts the communication between the engine layer and the engine core:
# Conceptual interface:
class EngineCoreClient:
def add_request(self, request: EngineCoreRequest) -> None:
"""Send request to the core."""
def get_output(self) -> list[EngineCoreOutput]:
"""Get completed/streaming outputs from the core."""
The client has two modes:
| Mode | When | How it works |
|---|---|---|
| In-process |
LLM class (offline) |
Direct function calls to EngineCore
|
| Multiprocess | API server | ZMQ sockets between processes |
In in-process mode (what you get with the LLM class), EngineCoreClient directly calls methods on an EngineCore object in the same process. No serialization overhead.
In multiprocess mode (the OpenAI-compatible server), EngineCoreClient serializes requests with msgspec.msgpack, sends them over ZMQ, and the EngineCore process deserializes and processes them. This keeps the FastAPI event loop responsive while heavy inference runs in a separate process.
# Simplified view of multiprocess communication:
# Frontend process:
encoded = msgspec.msgpack.encode(engine_core_request)
zmq_socket.send(encoded)
# Engine core process:
data = zmq_socket.recv()
request = msgspec.msgpack.decode(data, type=EngineCoreRequest)
scheduler.add_request(request)
Why this matters: This two-process architecture is critical for production deployments. Without it, long-running model forward passes on the GPU would block the HTTP server from accepting new requests.
6. The step() Method — One Iteration
Now we can understand what happens in each call to llm_engine.step():
# vllm/v1/engine/llm_engine.py (simplified)
def step(self) -> list[RequestOutput]:
# 1. Get raw outputs from the engine core
engine_core_outputs = self.engine_core.get_output()
# 2. Process outputs: detokenize, format, check completion
request_outputs = self.output_processor.process_outputs(
engine_core_outputs)
return request_outputs
Each step() returns a list of RequestOutput objects — some may be streaming (partial), others may be finished. The _run_engine() loop in LLM collects the finished ones.
But what triggers the core to actually run inference? In in-process mode, get_output() internally calls engine_core.step() which runs the scheduler + model execution. In multiprocess mode, the engine core runs its own loop continuously, and get_output() just reads from a queue.
7. OutputProcessor — From Tokens to Text
The OutputProcessor is the mirror of InputProcessor:
# vllm/v1/engine/output_processor.py
class OutputProcessor:
def __init__(self, tokenizer, log_stats, ...):
self.tokenizer = tokenizer
self.output_states: dict[str, RequestState] = {}
It receives EngineCoreOutput (raw token IDs from the core) and produces RequestOutput (user-facing results). The key operations:
- Accumulate tokens — Maintains a running state per request
- Detokenize — Converts token IDs back to text using the tokenizer
-
Handle streaming modes —
CUMULATIVEreturns the full text so far;DELTAreturns only new tokens -
Track completion — Checks
finish_reasonto know when a request is done
# EngineCoreOutput — what the core produces:
class EngineCoreOutput(msgspec.Struct):
request_id: str
new_token_ids: list[int] # Newly generated tokens this step
new_logprobs: LogprobsLists | None
finish_reason: FinishReason | None # STOP, LENGTH, ABORT, or None
stop_reason: int | str | None
num_cached_tokens: int = 0
The transformation:
EngineCoreOutput RequestOutput
┌──────────────────────┐ ┌─────────────────────────┐
│ request_id: "req-42" │ │ request_id: "req-42" │
│ new_token_ids: [464] │ detokenize │ prompt: "Hello" │
│ finish_reason: None │ ───────────→ │ outputs: [ │
│ │ │ CompletionOutput( │
└──────────────────────┘ │ text: " world", │
│ token_ids: [464], │
│ finish_reason: None │
│ ) │
│ ] │
│ finished: False │
└─────────────────────────┘
⚠️ Warning:
Detokenization is not trivially reversible. Many tokenizers use byte-level BPE, where a single token might represent part of a multi-byte UTF-8 character. TheOutputProcessorhandles these edge cases — if a token produces an incomplete character, it buffers bytes until a valid character is formed. This is why you sometimes see "garbled" output when accessing rawtoken_idswithout proper detokenization.
8. Putting It All Together — The Full Request Lifecycle
Let's trace a request from start to finish:
User calls: llm.generate(["What is AI?"], SamplingParams(max_tokens=20))
1. LLM.generate()
└→ _validate_and_add_requests()
└→ llm_engine.add_request(request_id="0", prompt="What is AI?", params=...)
└→ InputProcessor.process()
- Tokenize: "What is AI?" → [1, 1724, 338, 319, 29902, 29973]
- Create EngineCoreRequest(request_id="0",
prompt_token_ids=[1, 1724, ...],
sampling_params=...,
arrival_time=time.time())
└→ engine_core.add_request(engine_core_request)
2. LLM._run_engine()
while has_unfinished_requests():
└→ llm_engine.step()
└→ engine_core.get_output()
- Core runs: schedule → execute model → sample tokens
- Returns EngineCoreOutput(request_id="0",
new_token_ids=[23435],
finish_reason=None)
└→ OutputProcessor.process_outputs()
- Detokenize [23435] → " Artificial"
- Accumulate: text = " Artificial"
- Return RequestOutput(finished=False, ...)
... more steps, generating tokens one at a time ...
└→ llm_engine.step() (final iteration)
└→ engine_core.get_output()
- Returns EngineCoreOutput(request_id="0",
new_token_ids=[29889],
finish_reason=FinishReason.LENGTH)
└→ OutputProcessor.process_outputs()
- Detokenize [29889] → "."
- Accumulate: text = " Artificial intelligence is..."
- finish_reason = "length" (hit max_tokens=20)
- Return RequestOutput(finished=True, ...)
3. _run_engine() collects finished output, sorts by request_id, returns
📝 Note:
In practice, the engine core doesn't generate just one token per step. With continuous batching, a singlestep()processes tokens for all active requests simultaneously. If there are 50 active requests, one GPU forward pass generates the next token for all 50. TheOutputProcessorthen demultiplexes the results back to individualRequestOutputobjects.
Exercises
Exercise 1: Component Identification
Difficulty: Beginner
Goal: Verify you can identify the role of each engine layer component
For each of the following operations, name which component handles it:
- Converting the string
"Hello world"into token IDs[15496, 995] - Deciding which requests get GPU time this iteration
- Converting
EngineArgsinto aVllmConfig - Decoding token ID
[29889]back into the string"." - Sending an
EngineCoreRequestfrom the frontend process to the engine core process
Exercise 2: Multiprocess vs. In-Process
Difficulty: Intermediate
Goal: Understand when and why vLLM uses multiprocess communication
Consider two scenarios:
Scenario A: Offline batch processing
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(prompts, params)
Scenario B: Production API server
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
For each scenario:
- Is the
EngineCoreClientin in-process or multiprocess mode? - Does the
EngineCoreRequestactually get serialized with msgspec? - What would happen if the API server ran the engine core in-process (same event loop)?
Solution
Scenario A (offline LLM class):
-
In-process — direct function calls to
EngineCore. -
No — the
EngineCoreRequestis created but passed directly without serialization. - N/A — there's no HTTP server.
Scenario B (API server):
- Multiprocess — ZMQ sockets between frontend and engine core processes.
-
Yes —
msgspec.msgpack.encode()serializes it, sends over ZMQ, and the core deserializes it. - The HTTP server would block during GPU forward passes. A single inference step can take 10-100ms, during which the server couldn't accept new connections or respond to health checks. Under load, this would cause request timeouts and dropped connections.
Exercise 3: Tracing Data Transformations
Difficulty: Intermediate
Goal: Follow the data as it changes form through the pipeline
Starting with this call:
llm.generate(
[{"prompt_token_ids": [1, 2, 3, 4, 5]}],
SamplingParams(max_tokens=3, temperature=0)
)
- Does
InputProcessortokenize this prompt? Why or why not? - What fields of
EngineCoreRequestare set? What'sarrival_timeused for? - If the model generates tokens
[100, 200, 300], what does theEngineCoreOutputfor the final step look like? - What is
finish_reasonand why?
Solution
-
No. The input is
{"prompt_token_ids": [1, 2, 3, 4, 5]}— already tokenized.InputProcessordetects theprompt_token_idskey and skips tokenization, using the provided IDs directly. - Key fields:
request_id(auto-assigned),prompt_token_ids=[1, 2, 3, 4, 5],sampling_params(withmax_tokens=3, temperature=0),arrival_time=time.time().arrival_timeis used by the scheduler for FCFS ordering and for latency metrics. - The final
EngineCoreOutputwould be:EngineCoreOutput(request_id="0", new_token_ids=[300], finish_reason=FinishReason.LENGTH, stop_reason=None). Each step produces one new token, and the third token triggers the length limit. -
FinishReason.LENGTH— the model generated exactlymax_tokens=3tokens ([100, 200, 300]) and was stopped. It didn't hit an EOS or stop token naturally.
Exercise 4: Design Challenge — Adding Request Priority
Difficulty: Advanced
Goal: Think through how a feature propagates through the engine layer
Suppose you want to add priority-based scheduling: high-priority requests should be processed before low-priority ones. Trace through the architecture:
- Where does the user specify priority? (Hint: look at
LLM.generate()parameters) - How does priority get from the user to the scheduler? List each class it passes through.
- Why is
prioritya field onEngineCoreRequestrather than just onSamplingParams? - What would happen if the
OutputProcessoralso needed to know about priority? Would the current architecture support that?
Hint: Priority is already partially implemented — look at the EngineCoreRequest fields.
Solution
- Via the
priorityparameter inLLM.generate(prompts, params, priority=[1, 2, ...]). - The path is:
LLM.generate()→_validate_and_add_requests()→LLMEngine.add_request()→InputProcessor.process()(sets thepriorityfield onEngineCoreRequest) →EngineCoreClient.add_request()→Scheduler(readspriorityfrom the request). - Priority is a request-level concept, not a generation-level concept.
SamplingParamscontrols how tokens are sampled (temperature, top-p, etc.) — it's about the quality of the output. Priority controls when the request gets scheduled — it's about resource allocation. Mixing them would conflate two different concerns. - Yes — the
OutputProcessorreceivesEngineCoreOutputwhich includes therequest_id. It could look up priority from its internal state (it already maintains per-requestRequestState). But currently it doesn't need to — priority only matters for scheduling decisions.
Exercise 5: Streaming Output Modes
Difficulty: Advanced
Goal: Understand the difference between CUMULATIVE and DELTA output modes
Given a request that generates the text "Hello world!" as three tokens:
- Step 1: token
"Hello" - Step 2: token
" world" - Step 3: token
"!"
Write out what RequestOutput.outputs[0].text contains at each step for:
output_kind = RequestOutputKind.CUMULATIVEoutput_kind = RequestOutputKind.DELTA
When would you use each mode? Think about a streaming chat UI vs. a batch processing pipeline.
Solution
CUMULATIVE:
- Step 1:
"Hello" - Step 2:
"Hello world" - Step 3:
"Hello world!"
DELTA:
- Step 1:
"Hello" - Step 2:
" world" - Step 3:
"!"
When to use each: DELTA is ideal for streaming chat UIs — you append each delta directly to the display. CUMULATIVE is simpler for batch pipelines — you always have the full text so far, no need to track previous outputs. CUMULATIVE is the default because it's easier to use correctly.
Quiz
Answer these questions based on today's material. Try to answer each question before revealing the answer.
Q1: What are the three main components inside LLMEngine, and what does each one do?
Answer
InputProcessor, EngineCoreClient, and OutputProcessor. InputProcessor tokenizes prompts and creates EngineCoreRequest objects. EngineCoreClient sends requests to and receives outputs from the engine core (either in-process or via ZMQ). OutputProcessor detokenizes raw token IDs back into text and formats RequestOutput objects for the user.
Q2: Why does vLLM have both EngineCoreRequest and Request as separate types?
Answer
They serve different purposes across a process boundary. EngineCoreRequest is the transport/wire format — immutable, serializable with msgspec, designed to cross process boundaries efficiently. Request is the scheduler's internal runtime format — mutable, tracks state like num_computed_tokens, allocated KV cache blocks, and output tokens. Mixing these concerns would either make serialization expensive or make runtime tracking awkward.
Q3: What serialization format does vLLM use for inter-process communication, and why was it chosen over alternatives like pickle or JSON?
Answer
msgspec.msgpack — a binary MessagePack format. It's 10-50x faster than pickle for structured data and produces compact binary payloads. JSON was rejected because it's text-based (larger payloads, slower parsing). Pickle was rejected because it's slow for structured data and has security concerns. At thousands of requests per second, serialization overhead is a real bottleneck.
Q4: In multiprocess mode, what happens if the engine core is busy running a forward pass when a new HTTP request arrives?
Answer
The new request is accepted by the frontend process and queued. Because the frontend (FastAPI + InputProcessor) runs in a separate process from the engine core, it can accept and preprocess new HTTP requests while the GPU is busy. The EngineCoreRequest is sent over ZMQ and queued for the next scheduling iteration. This is exactly why the two-process architecture exists.
Q5: What does OutputProcessor do when it receives a token that represents an incomplete UTF-8 character?
Answer
It buffers the incomplete bytes until a valid character is formed. Many tokenizers use byte-level BPE, where tokens can split in the middle of multi-byte UTF-8 characters (e.g., emoji, CJK characters). The OutputProcessor accumulates bytes and only emits text when complete characters are available. This prevents garbled output in streaming responses.
Q6: True or false: In in-process mode (using the LLM class), EngineCoreRequest is still created even though it doesn't need to be serialized.
Answer
True. The InputProcessor always creates an EngineCoreRequest regardless of execution mode. In in-process mode, the request is passed directly to the engine core without serialization. The EngineCoreRequest type serves as a clean interface contract between the engine layer and the core, even when no process boundary exists.
Q7: What is the purpose of arrival_time in EngineCoreRequest?
Answer
It records when the request was submitted, enabling scheduling policies like FCFS (first-come-first-served). The scheduler can use arrival_time to prioritize older requests over newer ones. It's also used for metrics: you can measure end-to-end latency by comparing arrival_time with the completion time. Without it, the scheduler would have no notion of fairness or request ordering.
Q8: Why does LLMEngine.from_engine_args() exist as a classmethod factory instead of putting all the logic in __init__?
Answer
To separate argument parsing from construction. The factory method converts user-friendly EngineArgs (flat key-value pairs) into a structured VllmConfig (nested, validated configuration), selects the right executor class, and then calls __init__. This keeps __init__ simple — it receives fully validated, structured objects. It also allows alternative construction paths (e.g., creating LLMEngine directly with a VllmConfig for testing).
Summary
-
LLMEngineis the orchestrator that connects the user-facing API to the engine core, with three sub-components:InputProcessor,EngineCoreClient, andOutputProcessor -
InputProcessornormalizes various input formats (strings, token IDs, multimodal data) intoEngineCoreRequest— the standard wire format -
EngineCoreRequestusesmsgspec.Structfor fast serialization, enabling efficient multiprocess communication via ZMQ -
EngineCoreClientabstracts the communication mode: in-process for offline use, multiprocess (ZMQ) for production servers -
OutputProcessorreverses the input pipeline: accumulates tokens, detokenizes, handles streaming modes (CUMULATIVE vs DELTA), and producesRequestOutput - The two-process architecture (frontend + engine core) is critical for production: it keeps the HTTP server responsive while the GPU runs inference
- Next session: The Scheduler — how vLLM decides which requests get GPU time, the token budget system, and chunked prefill
Generated from my ai-study learning project.
Top comments (0)