vllm-omni serve¶

Stage-based CLI quickstart¶

The stage-based CLI is designed for deployments that require launching each pipeline stage in an isolated process (e.g., across separate operating system processes, distinct GPUs, or distributed hosts).

For migrated models that utilize the bundled deployment YAML configurations located in vllm_omni/deploy/, the --deploy-config flag is only required to override the default configuration. By default, executing vllm serve MODEL --omni ... automatically loads the bundled deployment configuration.
For legacy models utilizing configuration files located in vllm_omni/model_executor/stage_configs/, the --stage-configs-path parameter remains mandatory.

Example: Initializing Stage 0 (Orchestrator and API Server): The commands below show a common device mapping where Stage 0 uses GPU 0 and worker stages use GPU 1 via CUDA_VISIBLE_DEVICES.

CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --port 8091 \
    --stage-id 0 \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

Example: Initializing a Headless Worker Stage (Stage 1):

CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --stage-id 1 \
    --headless \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

When utilizing a custom deployment YAML based on the new schema, append --deploy-config /path/to/override.yaml to each command execution. Conversely, for legacy models, substitute this parameter with --stage-configs-path /path/to/stage_configs.yaml.

In the standard execution paradigm, the --stage-overrides argument is utilized to apply stage-specific configurations from a single CLI command. However, under the stage-based CLI paradigm, where each process strictly encapsulates a single stage, it is recommended to specify tuning parameters directly via discrete command-line flags for the respective stage, rather than constructing a composite --stage-overrides JSON string.

For example, as an alternative to the following composite configuration:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'

the stage-based CLI permits the direct initialization of Stage 1 with explicit parameters:

CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --stage-id 1 \
    --headless \
    --gpu-memory-utilization 0.5 \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

JSON CLI Arguments¶

Arguments¶

OmniConfig¶

Configuration for vLLM-Omni multi-stage and diffusion models.

`--omni`¶

Enable vLLM-Omni mode for multi-modal and diffusion models

Default: False

`--enable-sleep-mode`¶

Enable GPU memory pool for sleep mode.

Default: False

`--task-type`¶

Possible choices: CustomVoice, VoiceDesign, Base

Default task type for TTS models (CustomVoice, VoiceDesign, or Base). If not specified, will be inferred from model path.

Default: None

`--forced-aligner`¶

Enable streaming TTS word timestamps via a forced aligner. Pass the aligner model path/name, e.g. 'Qwen/Qwen3-ForcedAligner-0.6B'. Disabled when omitted.

Default: None

`--forced-aligner-config`¶

Optional YAML file for forced aligner settings (model, runner, gpu_memory_utilization, dtype, max_model_len). The --forced-aligner flag, when set, overrides the YAML model field.

Default: None

`--stage-configs-path`¶

[Deprecated — will be removed in a future release] Path to a legacy stage configs YAML (stage_args format). Prefer --deploy-config for new-format deploy YAMLs.

Default: None

`--deploy-config`¶

Path to a deploy config YAML (new format with stages/engine_args). Mutually exclusive with --stage-configs-path.

Default: None

`--stage-overrides`¶

Per-stage JSON overrides. Example: '{"0": {"gpu_memory_utilization": 0.8}, "2": {"enforce_eager": true}}'

Default: None

`--async-chunk`, `--no-async-chunk`¶

Override the deploy YAML's async_chunk: bool. Unset leaves the YAML value in force.

Default: None

`--stage-id`¶

Select and launch a single stage by stage_id.

Default: None

`--replica-id`¶

Deprecated and ignored — replica ids are auto-assigned by the master server. Specifying this flag prints a warning and has no effect.

Default: None

`--stage-init-timeout`¶

The timeout for initializing a single stage in seconds (default: 300)

Default: 300

`--init-timeout`¶

The timeout for initializing the stages.

Default: 600

`--shm-threshold-bytes`¶

The threshold for the shared memory size.

Default: 65536

`--log-stats`¶

Enable logging the stats.

Default: False

`--log-file`¶

The path to the log file.

Default: None

`--batch-timeout`¶

The timeout for the batch.

Default: 10

`--worker-backend`¶

Possible choices: multi_process, ray

The backend to use for stage workers.

Default: multi_process

`--ray-address`¶

The address of the Ray cluster to connect to.

Default: None

`--omni-master-address`, `-oma`¶

Hostname or IP address of the Omni orchestrator (master).

Default: None

`--omni-master-port`, `-omp`¶

Port of the Omni orchestrator (master).

Default: None

`--omni-replica-address`, `-ora`¶

Local bind address (this host's IP) that the headless stage advertises to the Omni master for its handshake/input/output ZMQ sockets. If unset, auto-detected via a UDP-connect routing probe against --omni-master-address. Override only when the auto-detected IP is wrong (e.g. multi-NIC host where the master is reachable on the wrong interface).

Default: None

`--omni-dp-size-local`¶

Number of stage replicas this runtime launches locally for its own --stage-id. Process-local: head and every headless invocation read their own copy; values may differ across invocations. Requires --stage-id to be set when not equal to 1.

Default: 1

`--omni-lb-policy`¶

Possible choices: random, round-robin, least-queue-length

Per-stage load-balancing policy used by the head's StagePool to route requests across UP replicas. Only consulted on the head runtime.

Default: random

`--omni-heartbeat-timeout`¶

Seconds before an unreporting replica is marked ERROR in the OmniCoordinator. Only consulted on the head runtime.

Default: 30.0

`--num-gpus`¶

Number of GPUs to use for diffusion model inference.

Default: None

`--model-class-name`¶

Override the diffusion pipeline class name (e.g. LTX2ImageToVideoPipeline).

Default: None

`--diffusion-load-format`¶

Possible choices: default, custom_pipeline, dummy, diffusers

How to load the diffusion pipeline: native/registry (default), custom_pipeline, dummy, or diffusers for the HF diffusers adapter.

Default: None

`--diffusers-load-kwargs`¶

JSON object passed to DiffusionPipeline.from_pretrained().It overrides corresponding parameters in the standard vLLM-Omni interface.(e.g. '{"use_safetensors": true, "variant": "fp16"}').

Default: {}

`--diffusers-call-kwargs`¶

JSON object passed to pipeline.call(). Useful for model-specific sampling parameters not covered by the vLLM-Omni interface.During request time, it is overridden by corresponding parameters in the vLLM-Omni interface.(e.g. '{"num_inference_steps": 30, "guidance_scale": 7.5}').

Default: {}

`--usp`, `--ulysses-degree`¶

Ulysses Sequence Parallelism degree for diffusion models. Equivalent to setting DiffusionParallelConfig.ulysses_degree.

Default: None

`--ulysses-mode`¶

Possible choices: strict, advanced_uaa

Ulysses sequence-parallel mode for diffusion models. 'strict' keeps the original divisibility requirements; 'advanced_uaa' enables the experimental UAA path for uneven sequence/head shapes.

Default: strict

`--ring`, `--ring-degree`¶

Ring Sequence Parallelism degree for diffusion models. Equivalent to setting DiffusionParallelConfig.ring_degree.

Default: None

`--diffusion-quantization-config`¶

JSON string for diffusion quantization_config. Example: '{"method":"gguf","gguf_model":"/path/to/model.gguf"}'.

Default: None

`--force-cutlass-fp8`¶

Diffusion-only runtime override for ModelOpt FP8 checkpoints: force CUTLASS FP8 linear kernels on CUDA SM89+ devices. Ignored for BF16, non-ModelOpt FP8, ROCm, and older CUDA GPUs.

Default: None

`--use-hsdp`¶

Enable HSDP (Hybrid Sharded Data Parallel) for diffusion models. Shards model weights across GPUs to reduce per-GPU memory usage.

Default: False

`--hsdp-shard-size`¶

Number of GPUs to shard weights across. -1 = auto (world_size / replicate_size).

Default: -1

`--hsdp-replicate-size`¶

Number of replica groups for HSDP. Each group holds a full sharded copy.

Default: 1

`--diffusion-attention-backend`¶

Diffusion attention backend (shorthand). Sets the default backend for all diffusion attention roles, e.g. 'FLASH_ATTN'. May be combined with --diffusion-attention-config.per_role.* overrides, but mutually exclusive with --diffusion-attention-config.default.backend.

Default: None

`--diffusion-attention-config`, `-dac`¶

Diffusion attention config. Accepts JSON or vLLM-style dotted flags. Examples: --diffusion-attention-config.default.backend FLASH_ATTN, --diffusion-attention-config.per_role.self.backend SPARSE_BLOCK, --diffusion-attention-config.per_role.cross.backend SAGE_ATTN, --diffusion-attention-config '{"default": {"backend": "FLASH_ATTN"}, "per_role": {"cross": {"backend": "SAGE_ATTN"}}}'.

Default: None

`--cache-backend`¶

Cache backend for diffusion models, options: 'tea_cache', 'cache_dit', 'mag_cache', 'step_cache'

Default: none

`--cache-config`¶

JSON string of cache configuration. TeaCache: '{"rel_l1_thresh": 0.2}'. MagCache: '{"mag_threshold": 0.24, "mag_max_skip_steps": 5, "mag_retention_ratio": 0.1}'. Calibration mode: add '"mag_calibrate": true'

Default: None

`--enable-cache-dit-summary`¶

Enable cache-dit summary logging after diffusion forward passes.

Default: False

`--step-execution`¶

Enable per-step diffusion execution so running requests can be aborted between denoise steps.

Default: False

`--request-batch-max-wait-ms`¶

Request-mode batch admission: max milliseconds to wait for compatible requests to accumulate before scheduling a fused forward wave. 0 disables admission (default).

Default: 0.0

`--vae-use-slicing`¶

Enable VAE slicing for memory optimization (useful for mitigating OOM issues).

Default: False

`--vae-use-tiling`¶

Enable VAE tiling for memory optimization (useful for mitigating OOM issues).

Default: False

`--disable-multithread-weight-load`¶

Disable multi-threaded safetensors loading (default: enabled with 4 threads).

Default: True

`--num-weight-load-threads`¶

Number of threads for parallel weight loading (default: 4).

Default: 4

`--enable-cpu-offload`¶

Enable CPU offloading for diffusion models.

Default: False

`--enable-layerwise-offload`¶

Enable layerwise (blockwise) offloading on DiT modules.

Default: False

`--boundary-ratio`¶

Boundary split ratio for low/high DiT in video models (e.g., 0.875 for Wan2.2).

Default: None

`--flow-shift`¶

Scheduler flow_shift for video models (e.g., 5.0 for 720p, 12.0 for 480p).

Default: None

`--diffusion-kv-cache-dtype`¶

Diffusion attention KV cache dtype (e.g. fp8). Separate from vLLM --kv-cache-dtype.

Default: None

`--diffusion-kv-cache-skip-steps`¶

Diffusion KV-cache quantization skip-step selector, e.g. '0-9,20,25-30'.

Default: None

`--diffusion-kv-cache-skip-layers`¶

Diffusion KV-cache quantization skip-layer selector, e.g. '0,1,4-8'.

Default: None

`--cfg-parallel-size`¶

Possible choices: 1, 2

Number of devices for CFG parallel computation for diffusion models. Equivalent to setting DiffusionParallelConfig.cfg_parallel_size.

Default: 1

`--vae-patch-parallel-size`¶

VAE Patch Parallelism degree for diffusion models. Distributes VAE decode workload across multiple ranks by splitting the latent spatially. Equivalent to setting DiffusionParallelConfig.vae_patch_parallel_size.

Default: 1

`--vae-parallel-mode`¶

Possible choices: tile, spatial_shard_height, spatial_shard_width

VAE parallel decode strategy for diffusion models. 'tile' (default) uses patch/tile parallel decode; 'spatial_shard_height'/'spatial_shard_width' use spatially-sharded decode that splits decoder feature maps along height/width and exchanges halo regions. The 'spatial_shard_*' modes require vae_patch_parallel_size to match the DiT group size. Equivalent to setting DiffusionParallelConfig.vae_parallel_mode.

Default: tile

`--default-sampling-params`¶

Json str for Default sampling parameters, Structure: {"": {: value, ...}, ...} e.g., '{"0": {"num_inference_steps":50, "guidance_scale":1}}'. Currently only supports diffusion models.

Default: None

`--max-generated-image-size`¶

Maximum generated image size in pixels (height * width).

Default: 33177600

`--diffusion-streaming-output`¶

Enable chunked streaming output for diffusion (mainly video generation) models that support it.

Default: False

`--tts-max-instructions-length`¶

Maximum length for TTS voice style instructions (overrides stage config, default: 500).

Default: None

`--no-guardrails`¶

Disable Cosmos3 text/video safety guardrails for this server.

Default: False

`--enable-diffusion-pipeline-profiler`¶

Enable diffusion pipeline profiler to display stage durations.

Default: False

`--enable-ar-profiler`¶

Enable AR stage profiler to include AR stage timing in stage_durations.

Default: False

`--enable-orch-monitor`¶

Enable orchestrator window monitor and write a JSON log at shutdown.

Default: False

`--auxiliary-text-encoder`¶

Auxiliary text encoder parameters model name or path (especially for Hidream-l1-full).

Default: None

vllm-omni serve¶

Stage-based CLI quickstart¶

JSON CLI Arguments¶

Arguments¶

OmniConfig¶

--omni¶

--enable-sleep-mode¶

--task-type¶

--forced-aligner¶

--forced-aligner-config¶

--stage-configs-path¶

--deploy-config¶

--stage-overrides¶

--async-chunk, --no-async-chunk¶

--stage-id¶

--replica-id¶

--stage-init-timeout¶

--init-timeout¶

--shm-threshold-bytes¶

--log-stats¶

--log-file¶

--batch-timeout¶

--worker-backend¶

--ray-address¶

--omni-master-address, -oma¶

--omni-master-port, -omp¶

--omni-replica-address, -ora¶

--omni-dp-size-local¶

--omni-lb-policy¶

--omni-heartbeat-timeout¶

--num-gpus¶

--model-class-name¶

--diffusion-load-format¶

--diffusers-load-kwargs¶

--diffusers-call-kwargs¶

--usp, --ulysses-degree¶

--ulysses-mode¶

--ring, --ring-degree¶

--diffusion-quantization-config¶

--force-cutlass-fp8¶

--use-hsdp¶

--hsdp-shard-size¶

--hsdp-replicate-size¶

--diffusion-attention-backend¶

--diffusion-attention-config, -dac¶

--cache-backend¶

--cache-config¶

--enable-cache-dit-summary¶

--step-execution¶

--request-batch-max-wait-ms¶

--vae-use-slicing¶

--vae-use-tiling¶

--disable-multithread-weight-load¶

--num-weight-load-threads¶

--enable-cpu-offload¶

--enable-layerwise-offload¶

--boundary-ratio¶

--flow-shift¶

--diffusion-kv-cache-dtype¶

--diffusion-kv-cache-skip-steps¶

--diffusion-kv-cache-skip-layers¶

--cfg-parallel-size¶

--vae-patch-parallel-size¶

--vae-parallel-mode¶

--default-sampling-params¶

--max-generated-image-size¶

--diffusion-streaming-output¶

--tts-max-instructions-length¶

--no-guardrails¶

--enable-diffusion-pipeline-profiler¶

--enable-ar-profiler¶

--enable-orch-monitor¶

--auxiliary-text-encoder¶

`--omni`¶

`--enable-sleep-mode`¶

`--task-type`¶

`--forced-aligner`¶

`--forced-aligner-config`¶

`--stage-configs-path`¶

`--deploy-config`¶

`--stage-overrides`¶

`--async-chunk`, `--no-async-chunk`¶

`--stage-id`¶

`--replica-id`¶

`--stage-init-timeout`¶

`--init-timeout`¶

`--shm-threshold-bytes`¶

`--log-stats`¶

`--log-file`¶

`--batch-timeout`¶

`--worker-backend`¶

`--ray-address`¶

`--omni-master-address`, `-oma`¶

`--omni-master-port`, `-omp`¶

`--omni-replica-address`, `-ora`¶

`--omni-dp-size-local`¶

`--omni-lb-policy`¶

`--omni-heartbeat-timeout`¶

`--num-gpus`¶

`--model-class-name`¶

`--diffusion-load-format`¶

`--diffusers-load-kwargs`¶

`--diffusers-call-kwargs`¶

`--usp`, `--ulysses-degree`¶

`--ulysses-mode`¶

`--ring`, `--ring-degree`¶

`--diffusion-quantization-config`¶

`--force-cutlass-fp8`¶

`--use-hsdp`¶

`--hsdp-shard-size`¶

`--hsdp-replicate-size`¶

`--diffusion-attention-backend`¶

`--diffusion-attention-config`, `-dac`¶

`--cache-backend`¶

`--cache-config`¶

`--enable-cache-dit-summary`¶

`--step-execution`¶

`--request-batch-max-wait-ms`¶

`--vae-use-slicing`¶

`--vae-use-tiling`¶

`--disable-multithread-weight-load`¶

`--num-weight-load-threads`¶

`--enable-cpu-offload`¶

`--enable-layerwise-offload`¶

`--boundary-ratio`¶

`--flow-shift`¶

`--diffusion-kv-cache-dtype`¶

`--diffusion-kv-cache-skip-steps`¶

`--diffusion-kv-cache-skip-layers`¶

`--cfg-parallel-size`¶

`--vae-patch-parallel-size`¶

`--vae-parallel-mode`¶

`--default-sampling-params`¶

`--max-generated-image-size`¶

`--diffusion-streaming-output`¶

`--tts-max-instructions-length`¶

`--no-guardrails`¶

`--enable-diffusion-pipeline-profiler`¶

`--enable-ar-profiler`¶

`--enable-orch-monitor`¶

`--auxiliary-text-encoder`¶