Stage configs for vLLM-Omni¶
In vLLM-Omni, the target model is separated into multiple stages, which are processed by different LLMEngines, DiffusionEngines or other types of engines. Depending on different types of stages, such as Autoregressive (AR) stage or Diffusion transformer (DiT) stage, each can choose corresponding schedulers, model workers to load with the Engines in a plug-in fashion.
Note
Default deploy config YAMLs (for example, vllm_omni/deploy/qwen2_5_omni.yaml, vllm_omni/deploy/qwen3_omni_moe.yaml, and vllm_omni/deploy/qwen3_tts.yaml) are bundled and loaded automatically when neither --stage-configs-path nor --deploy-config is provided — the model registry resolves the right pipeline + deploy YAML by model_type. The bundled defaults have been verified on 1xH100 for Qwen2.5-Omni and 2xH100 for Qwen3-Omni. Models that have not yet migrated to the new schema continue to use the legacy vllm_omni/model_executor/stage_configs/<model>.yaml files via --stage-configs-path.
New deploy schema reference¶
The new deploy schema lives under vllm_omni/deploy/ and is paired with a frozen PipelineConfig registered by the model's pipeline.py. Each deploy YAML has these top-level fields:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
base_config | str (path) | optional | — | Overlay parent (relative or absolute). stages: / platforms: deep-merged by stage_id; other scalars overlay-wins. Intended for user-authored overlays; prod yamls stay flat. |
async_chunk | bool | optional | true | Enable chunked streaming between stages. Pin to false if the pipeline runs end-to-end. |
connectors | dict | optional | null | Named connector specs ({name, extra}). Referenced by each stage's input_connectors / output_connectors. See Connector schema. |
edges | list | optional | null | Explicit edge list for the KV transfer graph. Auto-derived from stage inputs if omitted. |
stages | list | required | — | Per-stage engine args + wiring (see Stage fields). |
platforms | dict | optional | null | Keyed by npu / rocm / xpu, each contains a stages: list with per-platform overrides applied on top of the CUDA defaults. |
pipeline | str | optional | null | Override the auto-detected pipeline registry key (used for structural variants like qwen2_5_omni_thinker_only). |
trust_remote_code | bool | optional | true | Pipeline-wide. Trust HF remote code on model load; applies to every stage. |
distributed_executor_backend | str | null | optional | null | Pipeline-wide. Distributed executor backend forwarded to vLLM ("mp", "ray", "external_launcher"). If omitted, vLLM auto-selects backend from runtime topology. |
dtype | str | null | optional | null | Pipeline-wide. Model dtype for every stage. |
quantization | str | null | optional | null | Pipeline-wide. Quantization method for every stage. |
enable_prefix_caching | bool | optional | false | Pipeline-wide. Prefix cache toggle applied to every stage. |
enable_chunked_prefill | bool | null | optional | null | Pipeline-wide. Chunked prefill toggle applied to every stage. |
data_parallel_size | int | optional | 1 | Pipeline-wide. DP degree for every stage. |
pipeline_parallel_size | int | optional | 1 | Pipeline-wide. PP degree for every stage. |
Note: for diffusion path, distributed_executor_backend currently defaults to mp, and ray / external_launcher are not fully supported yet.
Stage fields¶
Each entry under stages: accepts any StageDeployConfig field directly (no nested engine_args:). Only fields whose value legitimately varies across stages live here; pipeline-wide settings (trust_remote_code, distributed_executor_backend, dtype, quantization, prefix/chunked prefill, DP/PP sizes) are declared at the top level and applied to every stage. Unknown keys fall through to engine_extras: and are forwarded to the engine.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
stage_id | int | required | — | Stage identity; matched against PipelineConfig.stages[*].stage_id. |
max_num_seqs | int | optional | 64 | Max concurrent sequences per stage. |
gpu_memory_utilization | float | optional | 0.9 | Per-stage memory budget. |
tensor_parallel_size | int | optional | 1 | TP degree for this stage. |
enforce_eager | bool | optional | false | Disable CUDA graphs. |
max_num_batched_tokens | int | optional | 32768 | Prefill budget. |
max_model_len | int | null | optional | null | Per-stage context length (auto-sets VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 when larger than HF default). |
async_scheduling | bool | null | optional | null | Per-stage async scheduling toggle. |
devices | str | optional | "0" | CUDA_VISIBLE_DEVICES-style device list. |
output_connectors | dict | null | optional | null | Keyed by to_stage_<n>; values are names registered under top-level connectors:. |
input_connectors | dict | null | optional | null | Keyed by from_stage_<n>; values are names registered under top-level connectors:. |
default_sampling_params | dict | null | optional | null | Baseline sampling params. Deep-merged with pipeline sampling_constraints (pipeline wins). |
engine_extras | dict | optional | {} | Catch-all for keys not listed above; deep-merged across overlays. Also carries per-stage overrides of pipeline-wide settings (e.g. stage-specific dtype). |
Connector schema¶
Each entry under top-level connectors: follows this shape:
connectors:
<connector_name>:
name: <ConnectorClassName> # required — class registered in vllm_omni.distributed
extra: # optional — forwarded to the connector's __init__
<key>: <value>
...
| Connector class | Use case | extra keys |
|---|---|---|
SharedMemoryConnector | Same-host KV transfer between stages (default for bundled YAMLs). | shm_threshold_bytes (int, default 65536). |
MooncakeStoreConnector | Cross-host KV transfer over TCP. Required for multi-node deployments. | host, metadata_server, master, segment (int bytes), localbuf (int bytes), proto ("tcp" / "rdma"). |
A stage references a connector by name in its input_connectors / output_connectors:
connectors:
shm:
name: SharedMemoryConnector
stages:
- stage_id: 0
output_connectors: {to_stage_1: shm}
- stage_id: 1
input_connectors: {from_stage_0: shm}
CLI flags introduced in this refactor¶
| Flag | Description |
|---|---|
--deploy-config PATH | Load a new-schema deploy YAML. Takes precedence over --stage-configs-path. Optional — when omitted, the bundled vllm_omni/deploy/<model_type>.yaml is auto-loaded by the model registry. |
--stage-overrides JSON | Per-stage JSON overrides, e.g. '{"0":{"gpu_memory_utilization":0.5}}'. Per-stage values always win over global flags. |
--async-chunk / --no-async-chunk | Flip the deploy YAML's async_chunk: bool. Unset (default) leaves the YAML value in force. |
--stage-configs-path | Deprecated. Accepts legacy stage_args yamls and (auto-detected) new deploy yamls; emits a deprecation warning. Migrate to --deploy-config. To be removed in a follow-up PR. |
Stage-Based CLI Paradigm¶
The stage-based CLI paradigm facilitates the execution of discrete pipeline stages within isolated processes:
- Stage 0 typically encapsulates the orchestrator and the primary API server. Invocation requires
--stage-id 0,--omni-master-address,--omni-master-port, and standard port declarations (e.g.,--port). - Worker Stages operate without a distinct API server (i.e., using
--headless), are assigned sequential--stage-ididentifiers, and must reference the corresponding--omni-master-addressand--omni-master-portparameters to successfully register with Stage 0.
For migrated architectures, the system automatically resolves and loads the bundled deployment YAML. Consequently, the primary execution path does not necessitate the explicit definition of --deploy-config: the example below uses CUDA_VISIBLE_DEVICES=0 for Stage 0 and CUDA_VISIBLE_DEVICES=1 for Stage 1.
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--port 8091 \
--stage-id 0 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
When instantiating a custom deployment YAML conforming to the updated schema, append the --deploy-config /path/to/override.yaml directive to all node invocations. For legacy architectures (e.g., BAGEL) configured via deprecated stage_args: schemas, continue to specify the relevant configuration via --stage-configs-path /path/to/config.yaml.
In the context of standard initialization architectures, utilizing the --stage-overrides parameter operates as the optimal methodology for delineating stage-specific tuning from the CLI interface:
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'
Conversely, in the context of the stage-based CLI paradigm, given that each execution process exclusively instantiates a single pipeline stage, configuration override attributes can be defined uniformly via explicit CLI flags on the corresponding instantiation command, rendering composite --stage-overrides JSON strings unnecessary:
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--gpu-memory-utilization 0.5 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
Precedence¶
From highest to lowest:
- Per-stage flags (
--stage-overridesJSON,--stage-<id>-<key>if registered) - Explicit global CLI flags (
--gpu-memory-utilization 0.85, etc.) - Platform section (
platforms.npu.stages, etc.) on top of the basestages: - Overlay YAML (via
base_config:) on top of the base YAML - Parser defaults
Worked override example¶
Starting from the bundled vllm_omni/deploy/qwen3_omni_moe.yaml:
# vllm_omni/deploy/qwen3_omni_moe.yaml (excerpt)
async_chunk: true
stages:
- stage_id: 0
gpu_memory_utilization: 0.9
max_num_seqs: 32
- stage_id: 1
gpu_memory_utilization: 0.7
max_num_seqs: 16
A user-authored overlay that inherits the base and overrides only stage 1:
# my_overrides.yaml
base_config: /path/to/vllm_omni/deploy/qwen3_omni_moe.yaml
stages:
- stage_id: 1
gpu_memory_utilization: 0.5 # smaller GPU
Launched with both an explicit global flag and a per-stage override:
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--deploy-config my_overrides.yaml \
--max-model-len 16384 \
--stage-overrides '{"0": {"max_num_seqs": 8}}'
Within the stage-based CLI paradigm, equivalent configuration parameters can inherently be passed directly as command-line arguments to the designated single-stage process instantiation:
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 0 \
--max-num-seqs 8 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
Effective config per stage after the merge:
| Stage | Field | Final value | Source |
|---|---|---|---|
| 0 | gpu_memory_utilization | 0.9 | base YAML (overlay didn't touch stage 0) |
| 0 | max_num_seqs | 8 | per-stage CLI (--stage-overrides) — wins over base 32 |
| 0 | max_model_len | 16384 | global CLI |
| 1 | gpu_memory_utilization | 0.5 | overlay YAML — wins over base 0.7 |
| 1 | max_num_seqs | 16 | base YAML (overlay didn't touch this field) |
| 1 | max_model_len | 16384 | global CLI |
| 2 | (all defaults) | — | base YAML (no overrides apply) |
Therefore, as a core part of vLLM-Omni, the stage configs for a model have several main functions:
- Claim partition of stages and their corresponding class implementation in
model_executor/models. - The disaggregated configuration for each stage and the communication topology among them.
- Engine arguments for each engine within the stage.
- Input and output dependencies for each stage.
- Default input parameters.
To override specific parameters, explicitly inject the customized configuration schema in both online and offline instantiation flows. Prioritize the --deploy-config flag when loading the new-schema deploy YAML schemas, reserving the --stage-configs-path parameter exclusively to maintain compatibility with legacy stage_args YAML constructs.
Examples:
For offline (Assume necessary dependencies have been imported):
model_name = "Qwen/Qwen2.5-Omni-7B"
omni = Omni(model=model_name, stage_configs_path="/path/to/custom_stage_configs.yaml")
For online serving:
Legacy online serving:
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
Important
We are actively iterating on the definition of stage configs, and we welcome all feedbacks from both community users and developers to help us shape the development!
Below is a specific example of stage_configs.yaml in Qwen2.5-omni.
# stage config for running qwen2.5-omni with AsyncOmniEngine + Orchestrator runtime.
stage_args:
- stage_id: 0 # mark the unique id for each stage
runtime: # The disaggregated configuration
process: true # Run this stage in a separate process
devices: "0" # Logical device index for this stage (mapped through CUDA_VISIBLE_DEVICES / ASCEND_RT_VISIBLE_DEVICES if set)
engine_args: # Engine arguments for a certain engine
model_stage: thinker
max_num_seqs: 1
model_arch: Qwen2_5OmniForConditionalGeneration # The model implementation registered in model_executor/models/registry.py
worker_type: ar # The specific worker used
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler # The specific scehduler used
gpu_memory_utilization: 0.8 # The gpu memory allocation for the stage within a single chip
enforce_eager: true # Now we only support eager mode
trust_remote_code: true # Needed by huggingface config parsing
engine_output_type: latent # It claims that the stage will input latent hiddenstates besides token ids
enable_prefix_caching: false # For request with hiddenstates output, the prefix caching is not supported now
is_comprehension: true # If the stage is a text or multimodal comprehension module. If it is, the AsyncOmni will use its tokenizer as default
final_output: true # If the stage has output as part of final outputs. If it is false, which means that the stage only works as a intermediate role.
final_output_type: text # What is the final output type. It can be text and audio now.
default_sampling_params: # sampling parameters for the stage. Their meaning aligns with vLLM.
temperature: 0.0
top_p: 1.0
top_k: -1
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.1
- stage_id: 1
runtime:
process: true
devices: "1"
engine_args:
model_stage: talker
max_num_seqs: 3
model_arch: Qwen2_5OmniForConditionalGeneration
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
gpu_memory_utilization: 0.8
enforce_eager: true
trust_remote_code: true
enable_prefix_caching: false
engine_output_type: latent
engine_input_source: [0]
custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker
default_sampling_params:
temperature: 0.9
top_p: 0.8
top_k: 40
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.05
stop_token_ids: [8294]
- stage_id: 2
runtime:
process: true
devices: "0" # Example: use a different GPU than the previous stage; use "0" if single GPU
engine_args:
model_stage: code2wav
max_num_seqs: 1
model_arch: Qwen2_5OmniForConditionalGeneration
worker_type: generation
scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
gpu_memory_utilization: 0.15
enforce_eager: true
trust_remote_code: true
enable_prefix_caching: false
engine_output_type: audio
engine_input_source: [1]
final_output: true
final_output_type: audio
default_sampling_params:
temperature: 0.0
top_p: 1.0
top_k: -1
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.1
# Top-level runtime config (concise): default windows and stage edges
runtime:
enabled: true
edges:
- from: 0 # thinker → talker: trigger only after receiving full input (-1)
to: 1
- from: 1 # talker → code2wav: trigger only after receiving full input (-1)
to: 2
Stage Configuration Arguments¶
Each stage in the stage_args list contains the following configuration options:
stage_id¶
A unique identifier for each stage in the multi-stage pipeline. Stages are numbered sequentially starting from 0, and this ID is used to reference stages in inter-stage dependencies (e.g., engine_input_source).
prompt_expand_func (Optional)¶
A custom Python function hook for the LLM stage (Stage 0) that expands a single incoming prompt object into multiple prompts. This is primarily used for multi-modal Classifier-Free Guidance (CFG), where it generates the necessary companion requests (like a negative text prompt) and tags them with internal roles (e.g., cfg_text). This ensures the upstream LLM generates the needed contextual hidden states for both the conditional and unconditional generations simultaneously.
cfg_kv_collect_func (Optional)¶
A custom Python function hook for downstream diffusion stages (Stage 1+) to collect, map, and process the KV caches transferred from the companion requests fired by prompt_expand_func. It aggregates the hidden condition states cleanly (e.g., binding them as cfg_text_past_key_values and cfg_text_kv_metadata), allowing the diffusion runtime to perform CFG smoothly without redundantly evaluating text paths on the DiT workers.
runtime¶
Configuration for disaggregated execution of the stage, controlling how the stage is deployed and executed.
runtime.process¶
Whether to run this stage in a separate process. When set to true, the stage will be executed in an isolated process, enabling better resource isolation and parallel execution across different stages. This is essential for multi-GPU deployments where different stages run on different devices.
Default: true
runtime.devices¶
Logical device indices for this stage, specified as a string. Values are logical indices (0, 1, 2, ...) — not physical GPU IDs — and are mapped through the platform's visibility env var (CUDA_VISIBLE_DEVICES on CUDA, ASCEND_RT_VISIBLE_DEVICES on NPU) before being applied via torch.cuda.set_device() (or the equivalent).
Example: if CUDA_VISIBLE_DEVICES=0,2,4 is set in the environment, then devices: "0" selects physical GPU 0 (the first visible), devices: "1" selects physical GPU 2, and devices: "0,1" makes physical GPUs 0 and 2 available to the stage. If no visibility env var is set, logical and physical IDs coincide.
Default: "0"
engine_args.max_num_seqs¶
The maximum number of sequences for concurrent processing in this stage. For LLM stages, this controls the vLLM scheduler's maximum concurrent sequences. For all stage types, this also controls how many tasks can be batched together in the task processing loop.
Default: 1
engine_args¶
Engine arguments for configuring the LLM engine, diffusion engine, or other engine types used by this stage.
engine_args.model_stage¶
The name identifier for this model stage within the multi-stage architecture. This is used internally to distinguish different stages of the same model (e.g., "thinker", "talker", "code2wav" in Qwen2.5-Omni).
engine_args.model_arch¶
The model architecture class name that is registered in model_executor/models/registry.py. This specifies which model implementation to use for this stage. The class must be registered in the model registry for vLLM-Omni to locate and instantiate it.
engine_args.worker_cls¶
The specific worker class to use for this stage. This determines how the model computations are executed. Examples include vllm_omni.worker.gpu_ar_worker.GPUARWorker for autoregressive stages and vllm_omni.worker.gpu_generation_worker.GPUGenerationWorker for diffusion-based stages.
engine_args.scheduler_cls¶
The scheduler class to use for this stage. The scheduler manages request queuing, batching, and execution order. Examples include vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler for standard stages and vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler for diffusion stages.
engine_args.gpu_memory_utilization¶
The fraction of GPU memory to allocate for this stage within a single GPU chip. This is a value between 0.0 and 1.0, where 0.8 means 80% of the GPU memory will be used by this stage. This allows fine-grained control over memory allocation when multiple stages share the same GPU or when reserving memory for other operations.
Default: 0.8
Memory Configuration Guide
For detailed information on how to calculate memory requirements and properly configure gpu_memory_utilization, see the GPU Memory Calculation and Configuration Guide.
engine_args.enforce_eager¶
Whether to enforce eager execution mode. When set to true, the engine will run in eager mode without using CUDA graphs or other compilation optimizations. Currently, vLLM-Omni only supports eager mode.
Default: true
engine_args.trust_remote_code¶
Whether to trust remote code when loading models from Hugging Face. This is required for models that use custom code in their configuration files. Set to true when loading models that require custom model implementations.
Default: true
engine_args.engine_output_type¶
Specifies the type of output produced by this stage's engine. This determines what kind of data flows to downstream stages. Possible values include latent (hidden states), text (tokenized text), and audio (audio waveforms). When set to latent, the stage outputs latent hidden states in addition to token IDs, which are consumed by downstream stages.
Default: latent
engine_args.enable_prefix_caching¶
Whether to enable prefix caching for this stage. Prefix caching can improve performance by caching KV cache for common prompt prefixes. However, for requests that output hidden states (when engine_output_type is latent), prefix caching is not currently supported and should be set to false.
Default: false
is_comprehension¶
Whether this stage is a text or multimodal comprehension module. When set to true, the stage acts as a comprehension module that processes input text or multimodal content. If this is the first comprehension stage, AsyncOmni will use its tokenizer as the default tokenizer for the entire pipeline.
Default: true
final_output¶
Whether this stage produces output that is part of the final outputs returned to the user. When set to false, the stage only works as an intermediate stage, processing data that flows to downstream stages but not contributing directly to the final response.
Default: true
final_output_type¶
The type of final output produced by this stage. This specifies what format the output will be in when returned to the user. Currently supported values are text (for text generation) and audio (for audio generation).
Default: text
default_sampling_params¶
Default sampling parameters for this stage. These parameters control the generation behavior and align with vLLM's sampling parameter semantics. These defaults are used when no explicit sampling parameters are provided in the request.
default_sampling_params.temperature¶
Sampling temperature for controlling randomness. Lower values (e.g., 0.0) make the output more deterministic and focused, while higher values increase randomness.
Default: 0.0
default_sampling_params.top_p¶
Nucleus sampling parameter. Only tokens with cumulative probability mass up to top_p are considered. This helps filter out low-probability tokens while maintaining diversity.
Default: 1.0
default_sampling_params.top_k¶
Top-k sampling parameter. Only the top k most likely tokens are considered. Set to -1 to disable top-k filtering and consider all tokens.
Default: -1
default_sampling_params.max_tokens¶
Maximum number of tokens to generate in this stage. This limits the length of the output sequence.
Default: 2048
default_sampling_params.seed¶
Random seed for reproducible generation. When set, the random number generator will be initialized with this seed to ensure consistent outputs across runs.
Default: 42
default_sampling_params.detokenize¶
Whether to detokenize the output tokens into text. When set to true, token IDs are converted back to readable text strings.
Default: True
default_sampling_params.repetition_penalty¶
Penalty applied to tokens that have already appeared in the generated sequence. Values greater than 1.0 discourage repetition, while values less than 1.0 encourage it. A value of 1.0 applies no penalty.
Default: 1.1
tts_args (TTS stages only)¶
Configuration for Text-to-Speech specific parameters. This section is only applicable to TTS model stages (e.g., qwen3_tts).
tts_args.max_instructions_length¶
Maximum character length for voice style/emotion instructions. Instructions exceeding this limit will be rejected with a validation error.
Default: 500
This value can be overridden at runtime using the --tts-max-instructions-length CLI parameter when starting the server.