AutoRegressive (AR) Module¶
1. Overview¶
The AutoRegressive (AR) module in vLLM-Omni handles autoregressive generation stages, primarily used for text, chain-of-thought(COT), and audio latent tokens generation stages in multi-stage models like Qwen2.5-Omni, Qwen3-Omni, BAGEL, .etc. Unlike some representative non-autoregressive generation stages (e.g., Diffusion), AR stages generate tokens sequentially, one at a time, following the standard transformer decoder pattern.
The AR module of vLLM-Omni extends vLLM's core components to support:
- Multimodal inputs/outputs: Processing images, videos, and audio alongside text
- Direct embedding transfer: Passing pre-computed prompt embeddings between pipeline stages via serialized payloads
- Additional information flow: Carrying per-request metadata (tensors, lists) through the pipeline
- Hidden state exposure: Exposing per-request hidden representations for downstream stages
- Basic generator support: Support some basic heterogeneous architecture such as Convolution, LSTM, etc.
As shown in the end2end example, AR module can be widely applied across multiple stages, generating text tokens in thinker(AR), audio latent tokens in talker(AR) and audio wave in code2wav(Convolution).
2. Relationship with vLLM¶
The AR module builds upon vLLM main framework through inheritance, extending core classes while preserving compatibility with vLLM's scheduling, batching, KV cache management, and execution mechanisms.
Inheritance Hierarchy¶
- Scheduler
classDiagram
class VLLMScheduler {
+schedule() SchedulerOutput
+update_from_output() EngineCoreOutputs
}
class OmniARScheduler {
+schedule() SchedulerOutput
}
class OmniGenerationScheduler {
+schedule() SchedulerOutput
+update_from_output() EngineCoreOutputs
}
VLLMScheduler <|-- OmniARScheduler
VLLMScheduler <|-- OmniGenerationScheduler - Worker classDiagram
class GPUWorker {
+init_device()
+model_runner
}
class GPUARWorker {
+init_device()
}
class GPUGenerationWorker {
+init_device()
}
GPUWorker <|-- GPUARWorker
GPUWorker <|-- GPUGenerationWorker - ModelRunner classDiagram
class GPUModelRunner {
+execute_model()
+sample_tokens()
}
class OmniGPUModelRunner {
+_update_states()
+_preprocess()
+_model_forward()
}
class GPUARModelRunner {
+execute_model()
+sample_tokens()
}
class GPUGenerationModelRunner {
+execute_model()
}
GPUModelRunner <|-- OmniGPUModelRunner
OmniGPUModelRunner <|-- GPUARModelRunner
OmniGPUModelRunner <|-- GPUGenerationModelRunner - InputProcessor/OutputProcessor classDiagram
class InputProcessor {
+process_inputs() EngineCoreRequest
}
class VLLMOutputProcessor {
+process_outputs() OutputProcessorOutput
}
class MultimodalOutputProcessor {
+process_outputs() OutputProcessorOutput
+_route_and_normalize()
}
VLLMOutputProcessor <|-- MultimodalOutputProcessor Key Extensions¶
- Scheduler:
OmniARSchedulerextendsvllm.v1.core.sched.scheduler.Schedulerto enrich scheduled requests with omni-specific payloads - Worker:
GPUARWorkerextendsvllm.v1.worker.gpu_worker.Workerto initialize AR-specific model runners - ModelRunner:
GPUARModelRunnerextendsOmniGPUModelRunner→vllm.v1.worker.gpu_model_runner.GPUModelRunnerto expose hidden states and handle multimodal outputs - InputProcessor: Stage-0 uses upstream
vllm.v1.engine.input_processor.InputProcessor;AsyncOmniEnginethen restores omni-specific payloads (for exampleadditional_informationandprompt_embeds) when buildingOmniEngineCoreRequest - OutputProcessor:
MultimodalOutputProcessorextendsvllm.v1.engine.output_processor.OutputProcessorto route and accumulate multimodal outputs
3. Scheduler Design¶
The AR module provides two scheduler implementations: one for standard autoregressive generation and one for basic heterogeneous architectures.
Request Flow¶
The following diagram illustrates the request flow through the AR module components:
flowchart TD
A[InputProcessor stage-0 in AsyncOmniEngine] -->|EngineCoreRequest then upgraded to OmniEngineCoreRequest| B[OmniARScheduler]
B -->|schedule: OmniNewRequestData| C[GPUARWorker]
C -->|SchedulerOutput| D[GPUARModelRunner]
D -->|execute_model: None| E[Model Forward Pass]
E -->|hidden_states, logits| D
D -->|sample_tokens: OmniModelRunnerOutput| F[OmniARScheduler]
F -->|update_from_output| G[MultimodalOutputProcessor]
G -->|RequestOutput| H[Client/Downstream Stage]
style A fill:#e1f5ff
style B fill:#fff4e1
style C fill:#e8f5e9
style D fill:#f3e5f5
style G fill:#fce4ec The flow follows vLLM's standard pattern: input processing → scheduling → worker execution → output processing, with omni-specific enrichments at each stage.
OmniARScheduler¶
OmniARScheduler extends the base vLLM scheduler with minimal modifications, focusing on enriching scheduled requests with omni-specific payloads.
Modified API: schedule()¶
The scheduler wraps base NewRequestData entries with OmniNewRequestData to include prompt embeddings and additional information:
def schedule(self) -> SchedulerOutput:
scheduler_output = super().schedule()
# Rewrap base NewRequestData entries with OmniNewRequestData
new_list = []
for nr in scheduler_output.scheduled_new_reqs:
request = self.requests.get(nr.req_id)
omni_nr = OmniNewRequestData(
req_id=nr.req_id,
prompt_token_ids=nr.prompt_token_ids,
# ... other base fields ...
prompt_embeds=getattr(request, "prompt_embeds", None),
additional_information=getattr(request, "additional_information", None),
)
new_list.append(omni_nr)
scheduler_output.scheduled_new_reqs = new_list
return scheduler_output
The update_from_output() method remains unchanged, inheriting standard request lifecycle management from the base scheduler.
OmniGenerationScheduler¶
OmniGenerationScheduler implements a fast-path scheduling strategy for basic heterogeneous architectures that process all input tokens in a single step.
Modified API: schedule()¶
Allocates all input tokens for a request at once (or 1 placeholder if zero), falling back to default scheduling if budget is insufficient:
def schedule(self) -> SchedulerOutput:
# Fast path: allocate all input tokens at once
while self.waiting and token_budget > 0:
request = self.waiting.peek_request()
required_tokens = max(getattr(request, "num_prompt_tokens", 0), 1)
if required_tokens > token_budget:
break # Fall back to default scheduling
# Allocate and schedule...
Modified API: update_from_output()¶
Marks requests as finished immediately after one step, since generation models complete in a single forward pass:
def update_from_output(self, ...) -> dict[int, EngineCoreOutputs]:
# ...
# Diffusion request: completes in one step
request.status = RequestStatus.FINISHED_STOPPED
kv_transfer_params = self._free_request(request)
# ...
4. Worker and ModelRunner Design¶
GPUARWorker¶
GPUARWorker initializes the AR-specific model runner while maintaining standard device initialization:
class GPUARWorker(GPUWorker):
def init_device(self):
# ... standard device initialization ...
self.model_runner = GPUARModelRunner(self.vllm_config, self.device)
GPUARModelRunner¶
GPUARModelRunner follows vLLM's two-phase execute/sample flow while exposing hidden states and multimodal outputs.
Two-Phase Execution¶
Phase 1: execute_model() - Runs forward pass and stores state: - Computes logits from hidden states - Stores ExecuteModelState with hidden states, logits, and multimodal outputs - Returns None to defer sampling
Phase 2: sample_tokens() - Samples tokens and builds output: - Retrieves stored state from execute_model() - Samples tokens using logits - Extracts per-request hidden states and multimodal outputs - Builds OmniModelRunnerOutput with pooler_output containing hidden states
def sample_tokens(self, grammar_output) -> OmniModelRunnerOutput:
# Retrieve stored state
hidden_states, multimodal_outputs = self.execute_model_state
# Sample tokens
sampler_output = self._sample(logits, spec_decode_metadata)
# Extract per-request hidden states
pooler_output = []
for rid in req_ids:
hidden_slice = hidden_states_cpu[start:end]
payload = {"hidden": hidden_slice}
# Add multimodal outputs if present
pooler_output.append(payload)
return OmniModelRunnerOutput(
pooler_output=pooler_output,
# ... other fields ...
)
GPUGenerationModelRunner¶
GPUGenerationModelRunner implements a simplified single-phase execution for basic heterogeneous architectures:
- No logits computation or token sampling
- Direct generation from forward pass in model implementation
- Returns outputs via
pooler_outputimmediately after forward pass
OmniGPUModelRunner¶
OmniGPUModelRunner provides shared functionality for both AR and Generation runners:
Prompt Embeddings Overlay¶
During prefill, overlays custom prompt_embeds from request state onto inputs_embeds:
def _collect_additional_information_for_prefill(self, num_scheduled_tokens_np):
for req_index, req_id in enumerate(self.input_batch.req_ids):
req_state = self.requests[req_id]
pe_cpu = getattr(req_state, "prompt_embeds_cpu", None)
# Overlay prompt_embeds for prefill portion
if pe_cpu is not None:
src = pe_cpu[num_computed_tokens:num_computed_tokens + overlay_len]
self.inputs_embeds[start_offset:start_offset + overlay_len].copy_(src)
Additional Information Processing¶
Decodes and manages additional_information payloads: - Decodes serialized payloads → CPU tensors in request state - Passes runtime information to model via runtime_additional_information kwarg - Processes model-provided updates via postprocess() hook - Merges updates back into request state
M-RoPE Position Initialization¶
For multimodal models using M-RoPE (e.g., Qwen2-VL), computes position encodings from multimodal feature metadata (image grids, video grids, audio features).
5. Input/Output Processing¶
Processing Pipeline¶
The input/output processing pipeline handles serialization, routing, and accumulation of multimodal data:
sequenceDiagram
participant Client
participant AsyncOmniEngine
participant InputProcessor
participant Scheduler
participant ModelRunner
participant MultimodalOutputProcessor
participant Client
Client->>AsyncOmniEngine: prompt + prompt_embeds + additional_info
AsyncOmniEngine->>InputProcessor: process_inputs()
InputProcessor->>Scheduler: EngineCoreRequest
AsyncOmniEngine->>AsyncOmniEngine: _upgrade_to_omni_request() + serialize_additional_information()
Scheduler->>ModelRunner: OmniNewRequestData (with payloads)
ModelRunner->>ModelRunner: Decode payloads → CPU tensors
ModelRunner->>ModelRunner: Overlay prompt_embeds on inputs_embeds
ModelRunner->>ModelRunner: Forward pass with runtime_additional_information
ModelRunner->>ModelRunner: Extract hidden states + multimodal outputs
ModelRunner->>MultimodalOutputProcessor: OmniModelRunnerOutput (pooler_output)
MultimodalOutputProcessor->>MultimodalOutputProcessor: Route by output_type
MultimodalOutputProcessor->>MultimodalOutputProcessor: Accumulate tensors in OmniRequestState
MultimodalOutputProcessor->>MultimodalOutputProcessor: Consolidate tensor lists
MultimodalOutputProcessor->>Client: RequestOutput (with multimodal_output) Stage-0 Input Processing¶
Stage-0 now uses upstream InputProcessor directly, and AsyncOmniEngine upgrades the request to OmniEngineCoreRequest while restoring omni-specific payloads.
request = self.input_processor.process_inputs(
request_id=request_id,
prompt=prompt,
params=params,
supported_tasks=self.supported_tasks,
)
request = _upgrade_to_omni_request(request, prompt)
MultimodalOutputProcessor¶
MultimodalOutputProcessor routes outputs by modality type and accumulates multimodal tensors.
Output Routing¶
Routes EngineCoreOutput by output_type attribute: - "text": Standard text generation path - "image", "audio", "latents": Extract from pooling_output or multimodal_outputs - Fallback: Heuristic based on presence of pooling_output
Tensor Accumulation¶
OmniRequestState accumulates multimodal tensors across multiple steps:
def add_multimodal_tensor(self, payload, mm_type):
# Normalize payload to dict
incoming = {mm_type or "hidden": payload}
# Accumulate: convert tensors to lists for deferred concatenation
if isinstance(v, torch.Tensor) and isinstance(existing, torch.Tensor):
self.mm_accumulated[k] = [existing, v] # List accumulation
Before final output, consolidates tensor lists via concatenation:
def _consolidate_multimodal_tensors(self):
for k, v in self.mm_accumulated.items():
if isinstance(v, list) and isinstance(v[0], torch.Tensor):
self.mm_accumulated[k] = torch.cat(v, dim=0) # Concatenate
The consolidated tensors are attached to RequestOutput.multimodal_output for consumption by downstream stages or clients.
6. Summary¶
The AR module of vLLM-Omni extends vLLM through strategic inheritance and minimal API modifications:
Key Design Patterns¶
- Inheritance over composition: Extends vLLM classes to preserve compatibility with existing scheduling, batching, and execution mechanisms
- Payload serialization: Uses serialized
additional_informationpayloads together with prompt-embedding handoff for efficient inter-stage transfer - Two-phase execution: Maintains vLLM's execute/sample separation for AR models while supporting single-phase execution for generation models
- Multimodal routing: Routes outputs by
output_typeand accumulates tensors incrementally to support streaming
Differences from vLLM¶
- Payload support: Serialized additional information and prompt embeddings enable direct transfer between pipeline stages
- Multimodal handling: Extended input/output processors support images, audio, and other modalities alongside text
- Hidden state exposure: AR model runners expose per-request hidden states via
pooler_outputfor downstream consumption - Generation scheduler: Fast-path scheduling for basic heterogeneous architectures that complete in one step
The AR module seamlessly integrates with vLLM's existing infrastructure while adding the necessary extensions for multi-stage, multimodal generation pipelines.