Skip to content

AutoRegressive (AR) Module

1. Overview

The AutoRegressive (AR) module in vLLM-Omni handles autoregressive generation stages, primarily used for text, chain-of-thought(COT), and audio latent tokens generation stages in multi-stage models like Qwen2.5-Omni, Qwen3-Omni, BAGEL, .etc. Unlike some representative non-autoregressive generation stages (e.g., Diffusion), AR stages generate tokens sequentially, one at a time, following the standard transformer decoder pattern.

The AR module of vLLM-Omni extends vLLM's core components to support:

  • Multimodal inputs/outputs: Processing images, videos, and audio alongside text
  • Direct embedding transfer: Passing pre-computed prompt embeddings between pipeline stages via serialized payloads
  • Additional information flow: Carrying per-request metadata (tensors, lists) through the pipeline
  • Hidden state exposure: Exposing per-request hidden representations for downstream stages
  • Basic generator support: Support some basic heterogeneous architecture such as Convolution, LSTM, etc.

As shown in the end2end example, AR module can be widely applied across multiple stages, generating text tokens in thinker(AR), audio latent tokens in talker(AR) and audio wave in code2wav(Convolution).

2. Relationship with vLLM

The AR module builds upon vLLM main framework through inheritance, extending core classes while preserving compatibility with vLLM's scheduling, batching, KV cache management, and execution mechanisms.

Inheritance Hierarchy

  • Scheduler

classDiagram
    class VLLMScheduler {
        +schedule() SchedulerOutput
        +update_from_output() EngineCoreOutputs
    }
    class OmniARScheduler {
        +schedule() SchedulerOutput
    }
    class OmniGenerationScheduler {
        +schedule() SchedulerOutput
        +update_from_output() EngineCoreOutputs
    }
    VLLMScheduler <|-- OmniARScheduler
    VLLMScheduler <|-- OmniGenerationScheduler
- Worker

classDiagram
    class GPUWorker {
        +init_device()
        +model_runner
    }
    class GPUARWorker {
        +init_device()
    }
    class GPUGenerationWorker {
        +init_device()
    }
    GPUWorker <|-- GPUARWorker
    GPUWorker <|-- GPUGenerationWorker
- ModelRunner

classDiagram
    class GPUModelRunner {
        +execute_model()
        +sample_tokens()
    }
    class OmniGPUModelRunner {
        +_update_states()
        +_preprocess()
        +_model_forward()
    }
    class GPUARModelRunner {
        +execute_model()
        +sample_tokens()
    }
    class GPUGenerationModelRunner {
        +execute_model()
    }
    GPUModelRunner <|-- OmniGPUModelRunner
    OmniGPUModelRunner <|-- GPUARModelRunner
    OmniGPUModelRunner <|-- GPUGenerationModelRunner
- InputProcessor/OutputProcessor

classDiagram
    class InputProcessor {
        +process_inputs() EngineCoreRequest
    }

    class VLLMOutputProcessor {
        +process_outputs() OutputProcessorOutput
    }
    class MultimodalOutputProcessor {
        +process_outputs() OutputProcessorOutput
        +_route_and_normalize()
    }
    VLLMOutputProcessor <|-- MultimodalOutputProcessor

Key Extensions

  • Scheduler: OmniARScheduler extends vllm.v1.core.sched.scheduler.Scheduler to enrich scheduled requests with omni-specific payloads
  • Worker: GPUARWorker extends vllm.v1.worker.gpu_worker.Worker to initialize AR-specific model runners
  • ModelRunner: GPUARModelRunner extends OmniGPUModelRunnervllm.v1.worker.gpu_model_runner.GPUModelRunner to expose hidden states and handle multimodal outputs
  • InputProcessor: Stage-0 uses upstream vllm.v1.engine.input_processor.InputProcessor; AsyncOmniEngine then restores omni-specific payloads (for example additional_information and prompt_embeds) when building OmniEngineCoreRequest
  • OutputProcessor: MultimodalOutputProcessor extends vllm.v1.engine.output_processor.OutputProcessor to route and accumulate multimodal outputs

3. Scheduler Design

The AR module provides two scheduler implementations: one for standard autoregressive generation and one for basic heterogeneous architectures.

Request Flow

The following diagram illustrates the request flow through the AR module components:

flowchart TD
    A[InputProcessor stage-0 in AsyncOmniEngine] -->|EngineCoreRequest then upgraded to OmniEngineCoreRequest| B[OmniARScheduler]
    B -->|schedule: OmniNewRequestData| C[GPUARWorker]
    C -->|SchedulerOutput| D[GPUARModelRunner]
    D -->|execute_model: None| E[Model Forward Pass]
    E -->|hidden_states, logits| D
    D -->|sample_tokens: OmniModelRunnerOutput| F[OmniARScheduler]
    F -->|update_from_output| G[MultimodalOutputProcessor]
    G -->|RequestOutput| H[Client/Downstream Stage]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#e8f5e9
    style D fill:#f3e5f5
    style G fill:#fce4ec

The flow follows vLLM's standard pattern: input processing → scheduling → worker execution → output processing, with omni-specific enrichments at each stage.

OmniARScheduler

OmniARScheduler extends the base vLLM scheduler with minimal modifications, focusing on enriching scheduled requests with omni-specific payloads.

Modified API: schedule()

The scheduler wraps base NewRequestData entries with OmniNewRequestData to include prompt embeddings and additional information:

def schedule(self) -> SchedulerOutput:
    scheduler_output = super().schedule()
    # Rewrap base NewRequestData entries with OmniNewRequestData
    new_list = []
    for nr in scheduler_output.scheduled_new_reqs:
        request = self.requests.get(nr.req_id)
        omni_nr = OmniNewRequestData(
            req_id=nr.req_id,
            prompt_token_ids=nr.prompt_token_ids,
            # ... other base fields ...
            prompt_embeds=getattr(request, "prompt_embeds", None),
            additional_information=getattr(request, "additional_information", None),
        )
        new_list.append(omni_nr)
    scheduler_output.scheduled_new_reqs = new_list
    return scheduler_output

The update_from_output() method remains unchanged, inheriting standard request lifecycle management from the base scheduler.

OmniGenerationScheduler

OmniGenerationScheduler implements a fast-path scheduling strategy for basic heterogeneous architectures that process all input tokens in a single step.

Modified API: schedule()

Allocates all input tokens for a request at once (or 1 placeholder if zero), falling back to default scheduling if budget is insufficient:

def schedule(self) -> SchedulerOutput:
    # Fast path: allocate all input tokens at once
    while self.waiting and token_budget > 0:
        request = self.waiting.peek_request()
        required_tokens = max(getattr(request, "num_prompt_tokens", 0), 1)
        if required_tokens > token_budget:
            break  # Fall back to default scheduling
        # Allocate and schedule...

Modified API: update_from_output()

Marks requests as finished immediately after one step, since generation models complete in a single forward pass:

def update_from_output(self, ...) -> dict[int, EngineCoreOutputs]:
    # ...
    # Diffusion request: completes in one step
    request.status = RequestStatus.FINISHED_STOPPED
    kv_transfer_params = self._free_request(request)
    # ...

4. Worker and ModelRunner Design

GPUARWorker

GPUARWorker initializes the AR-specific model runner while maintaining standard device initialization:

class GPUARWorker(GPUWorker):
    def init_device(self):
        # ... standard device initialization ...
        self.model_runner = GPUARModelRunner(self.vllm_config, self.device)

GPUARModelRunner

GPUARModelRunner follows vLLM's two-phase execute/sample flow while exposing hidden states and multimodal outputs.

Two-Phase Execution

Phase 1: execute_model() - Runs forward pass and stores state: - Computes logits from hidden states - Stores ExecuteModelState with hidden states, logits, and multimodal outputs - Returns None to defer sampling

Phase 2: sample_tokens() - Samples tokens and builds output: - Retrieves stored state from execute_model() - Samples tokens using logits - Extracts per-request hidden states and multimodal outputs - Builds OmniModelRunnerOutput with pooler_output containing hidden states

def sample_tokens(self, grammar_output) -> OmniModelRunnerOutput:
    # Retrieve stored state
    hidden_states, multimodal_outputs = self.execute_model_state

    # Sample tokens
    sampler_output = self._sample(logits, spec_decode_metadata)

    # Extract per-request hidden states
    pooler_output = []
    for rid in req_ids:
        hidden_slice = hidden_states_cpu[start:end]
        payload = {"hidden": hidden_slice}
        # Add multimodal outputs if present
        pooler_output.append(payload)

    return OmniModelRunnerOutput(
        pooler_output=pooler_output,
        # ... other fields ...
    )

GPUGenerationModelRunner

GPUGenerationModelRunner implements a simplified single-phase execution for basic heterogeneous architectures:

  • No logits computation or token sampling
  • Direct generation from forward pass in model implementation
  • Returns outputs via pooler_output immediately after forward pass

OmniGPUModelRunner

OmniGPUModelRunner provides shared functionality for both AR and Generation runners:

Prompt Embeddings Overlay

During prefill, overlays custom prompt_embeds from request state onto inputs_embeds:

def _collect_additional_information_for_prefill(self, num_scheduled_tokens_np):
    for req_index, req_id in enumerate(self.input_batch.req_ids):
        req_state = self.requests[req_id]
        pe_cpu = getattr(req_state, "prompt_embeds_cpu", None)
        # Overlay prompt_embeds for prefill portion
        if pe_cpu is not None:
            src = pe_cpu[num_computed_tokens:num_computed_tokens + overlay_len]
            self.inputs_embeds[start_offset:start_offset + overlay_len].copy_(src)

Additional Information Processing

Decodes and manages additional_information payloads: - Decodes serialized payloads → CPU tensors in request state - Passes runtime information to model via runtime_additional_information kwarg - Processes model-provided updates via postprocess() hook - Merges updates back into request state

M-RoPE Position Initialization

For multimodal models using M-RoPE (e.g., Qwen2-VL), computes position encodings from multimodal feature metadata (image grids, video grids, audio features).

5. Input/Output Processing

Processing Pipeline

The input/output processing pipeline handles serialization, routing, and accumulation of multimodal data:

sequenceDiagram
    participant Client
    participant AsyncOmniEngine
    participant InputProcessor
    participant Scheduler
    participant ModelRunner
    participant MultimodalOutputProcessor
    participant Client

    Client->>AsyncOmniEngine: prompt + prompt_embeds + additional_info
    AsyncOmniEngine->>InputProcessor: process_inputs()
    InputProcessor->>Scheduler: EngineCoreRequest
    AsyncOmniEngine->>AsyncOmniEngine: _upgrade_to_omni_request() + serialize_additional_information()
    Scheduler->>ModelRunner: OmniNewRequestData (with payloads)
    ModelRunner->>ModelRunner: Decode payloads → CPU tensors
    ModelRunner->>ModelRunner: Overlay prompt_embeds on inputs_embeds
    ModelRunner->>ModelRunner: Forward pass with runtime_additional_information
    ModelRunner->>ModelRunner: Extract hidden states + multimodal outputs
    ModelRunner->>MultimodalOutputProcessor: OmniModelRunnerOutput (pooler_output)
    MultimodalOutputProcessor->>MultimodalOutputProcessor: Route by output_type
    MultimodalOutputProcessor->>MultimodalOutputProcessor: Accumulate tensors in OmniRequestState
    MultimodalOutputProcessor->>MultimodalOutputProcessor: Consolidate tensor lists
    MultimodalOutputProcessor->>Client: RequestOutput (with multimodal_output)

Stage-0 Input Processing

Stage-0 now uses upstream InputProcessor directly, and AsyncOmniEngine upgrades the request to OmniEngineCoreRequest while restoring omni-specific payloads.

request = self.input_processor.process_inputs(
    request_id=request_id,
    prompt=prompt,
    params=params,
    supported_tasks=self.supported_tasks,
)
request = _upgrade_to_omni_request(request, prompt)

MultimodalOutputProcessor

MultimodalOutputProcessor routes outputs by modality type and accumulates multimodal tensors.

Output Routing

Routes EngineCoreOutput by output_type attribute: - "text": Standard text generation path - "image", "audio", "latents": Extract from pooling_output or multimodal_outputs - Fallback: Heuristic based on presence of pooling_output

Tensor Accumulation

OmniRequestState accumulates multimodal tensors across multiple steps:

def add_multimodal_tensor(self, payload, mm_type):
    # Normalize payload to dict
    incoming = {mm_type or "hidden": payload}

    # Accumulate: convert tensors to lists for deferred concatenation
    if isinstance(v, torch.Tensor) and isinstance(existing, torch.Tensor):
        self.mm_accumulated[k] = [existing, v]  # List accumulation

Before final output, consolidates tensor lists via concatenation:

def _consolidate_multimodal_tensors(self):
    for k, v in self.mm_accumulated.items():
        if isinstance(v, list) and isinstance(v[0], torch.Tensor):
            self.mm_accumulated[k] = torch.cat(v, dim=0)  # Concatenate

The consolidated tensors are attached to RequestOutput.multimodal_output for consumption by downstream stages or clients.

6. Summary

The AR module of vLLM-Omni extends vLLM through strategic inheritance and minimal API modifications:

Key Design Patterns

  1. Inheritance over composition: Extends vLLM classes to preserve compatibility with existing scheduling, batching, and execution mechanisms
  2. Payload serialization: Uses serialized additional_information payloads together with prompt-embedding handoff for efficient inter-stage transfer
  3. Two-phase execution: Maintains vLLM's execute/sample separation for AR models while supporting single-phase execution for generation models
  4. Multimodal routing: Routes outputs by output_type and accumulates tensors incrementally to support streaming

Differences from vLLM

  • Payload support: Serialized additional information and prompt embeddings enable direct transfer between pipeline stages
  • Multimodal handling: Extended input/output processors support images, audio, and other modalities alongside text
  • Hidden state exposure: AR model runners expose per-request hidden states via pooler_output for downstream consumption
  • Generation scheduler: Fast-path scheduling for basic heterogeneous architectures that complete in one step

The AR module seamlessly integrates with vLLM's existing infrastructure while adding the necessary extensions for multi-stage, multimodal generation pipelines.