Skip to content

vllm-omni serve

Stage-based CLI quickstart

The stage-based CLI is designed for deployments that require launching each pipeline stage in an isolated process (e.g., across separate operating system processes, distinct GPUs, or distributed hosts).

  • For migrated models that utilize the bundled deployment YAML configurations located in vllm_omni/deploy/, the --deploy-config flag is only required to override the default configuration. By default, executing vllm serve MODEL --omni ... automatically loads the bundled deployment configuration.
  • For legacy models utilizing configuration files located in vllm_omni/model_executor/stage_configs/, the --stage-configs-path parameter remains mandatory.

Example: Initializing Stage 0 (Orchestrator and API Server): The commands below show a common device mapping where Stage 0 uses GPU 0 and worker stages use GPU 1 via CUDA_VISIBLE_DEVICES.

CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --port 8091 \
    --stage-id 0 \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

Example: Initializing a Headless Worker Stage (Stage 1):

CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --stage-id 1 \
    --headless \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

When utilizing a custom deployment YAML based on the new schema, append --deploy-config /path/to/override.yaml to each command execution. Conversely, for legacy models, substitute this parameter with --stage-configs-path /path/to/stage_configs.yaml.

In the standard execution paradigm, the --stage-overrides argument is utilized to apply stage-specific configurations from a single CLI command. However, under the stage-based CLI paradigm, where each process strictly encapsulates a single stage, it is recommended to specify tuning parameters directly via discrete command-line flags for the respective stage, rather than constructing a composite --stage-overrides JSON string.

For example, as an alternative to the following composite configuration:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'

the stage-based CLI permits the direct initialization of Stage 1 with explicit parameters:

CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --stage-id 1 \
    --headless \
    --gpu-memory-utilization 0.5 \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

JSON CLI Arguments

Arguments

OmniConfig

Configuration for vLLM-Omni multi-stage and diffusion models.

--omni

Enable vLLM-Omni mode for multi-modal and diffusion models

Default: False

--enable-sleep-mode

Enable GPU memory pool for sleep mode.

Default: False

--task-type

Possible choices: CustomVoice, VoiceDesign, Base

Default task type for TTS models (CustomVoice, VoiceDesign, or Base). If not specified, will be inferred from model path.

Default: None

--stage-configs-path

[Deprecated — will be removed in a future release] Path to a legacy stage configs YAML (stage_args format). Prefer --deploy-config for new-format deploy YAMLs.

Default: None

--deploy-config

Path to a deploy config YAML (new format with stages/engine_args). Mutually exclusive with --stage-configs-path.

Default: None

--stage-overrides

Per-stage JSON overrides. Example: '{"0": {"gpu_memory_utilization": 0.8}, "2": {"enforce_eager": true}}'

Default: None

--async-chunk, --no-async-chunk

Override the deploy YAML's async_chunk: bool. Unset leaves the YAML value in force.

Default: None

--stage-id

Select and launch a single stage by stage_id.

Default: None

--replica-id

Deprecated and ignored — replica ids are auto-assigned by the master server. Specifying this flag prints a warning and has no effect.

Default: None

--stage-init-timeout

The timeout for initializing a single stage in seconds (default: 300)

Default: 300

--init-timeout

The timeout for initializing the stages.

Default: 600

--shm-threshold-bytes

The threshold for the shared memory size.

Default: 65536

--log-stats

Enable logging the stats.

Default: False

--log-file

The path to the log file.

Default: None

--batch-timeout

The timeout for the batch.

Default: 10

--worker-backend

Possible choices: multi_process, ray

The backend to use for stage workers.

Default: multi_process

--ray-address

The address of the Ray cluster to connect to.

Default: None

--omni-master-address, -oma

Hostname or IP address of the Omni orchestrator (master).

Default: None

--omni-master-port, -omp

Port of the Omni orchestrator (master).

Default: None

--omni-replica-address, -ora

Local bind address (this host's IP) that the headless stage advertises to the Omni master for its handshake/input/output ZMQ sockets. If unset, auto-detected via a UDP-connect routing probe against --omni-master-address. Override only when the auto-detected IP is wrong (e.g. multi-NIC host where the master is reachable on the wrong interface).

Default: None

--omni-dp-size-local

Number of stage replicas this runtime launches locally for its own --stage-id. Process-local: head and every headless invocation read their own copy; values may differ across invocations. Requires --stage-id to be set when not equal to 1.

Default: 1

--omni-lb-policy

Possible choices: random, round-robin, least-queue-length

Per-stage load-balancing policy used by the head's StagePool to route requests across UP replicas. Only consulted on the head runtime.

Default: random

--omni-heartbeat-timeout

Seconds before an unreporting replica is marked ERROR in the OmniCoordinator. Only consulted on the head runtime.

Default: 30.0

--num-gpus

Number of GPUs to use for diffusion model inference.

Default: None

--model-class-name

Override the diffusion pipeline class name (e.g. LTX2ImageToVideoPipeline).

Default: None

--diffusion-load-format

Possible choices: default, custom_pipeline, dummy, diffusers

How to load the diffusion pipeline: native/registry (default), custom_pipeline, dummy, or diffusers for the HF diffusers adapter.

Default: None

--diffusers-load-kwargs

JSON object passed to DiffusionPipeline.from_pretrained().It overrides corresponding parameters in the standard vLLM-Omni interface.(e.g. '{"use_safetensors": true, "variant": "fp16"}').

Default: {}

--diffusers-call-kwargs

JSON object passed to pipeline.call(). Useful for model-specific sampling parameters not covered by the vLLM-Omni interface.During request time, it is overridden by corresponding parameters in the vLLM-Omni interface.(e.g. '{"num_inference_steps": 30, "guidance_scale": 7.5}').

Default: {}

--usp, --ulysses-degree

Ulysses Sequence Parallelism degree for diffusion models. Equivalent to setting DiffusionParallelConfig.ulysses_degree.

Default: None

--ulysses-mode

Possible choices: strict, advanced_uaa

Ulysses sequence-parallel mode for diffusion models. 'strict' keeps the original divisibility requirements; 'advanced_uaa' enables the experimental UAA path for uneven sequence/head shapes.

Default: strict

--ring, --ring-degree

Ring Sequence Parallelism degree for diffusion models. Equivalent to setting DiffusionParallelConfig.ring_degree.

Default: None

--diffusion-quantization-config

JSON string for diffusion quantization_config. Example: '{"method":"gguf","gguf_model":"/path/to/model.gguf"}'.

Default: None

--force-cutlass-fp8

Diffusion-only runtime override for ModelOpt FP8 checkpoints: force CUTLASS FP8 linear kernels on CUDA SM89+ devices. Ignored for BF16, non-ModelOpt FP8, ROCm, and older CUDA GPUs.

Default: None

--use-hsdp

Enable HSDP (Hybrid Sharded Data Parallel) for diffusion models. Shards model weights across GPUs to reduce per-GPU memory usage.

Default: False

--hsdp-shard-size

Number of GPUs to shard weights across. -1 = auto (world_size / replicate_size).

Default: -1

--hsdp-replicate-size

Number of replica groups for HSDP. Each group holds a full sharded copy.

Default: 1

--diffusion-attention-backend

Diffusion attention backend (shorthand). Sets the default backend for all diffusion attention roles, e.g. 'FLASH_ATTN'. May be combined with --diffusion-attention-config.per_role.* overrides, but mutually exclusive with --diffusion-attention-config.default.backend.

Default: None

--diffusion-attention-config, -dac

Diffusion attention config. Accepts JSON or vLLM-style dotted flags. Examples: --diffusion-attention-config.default.backend FLASH_ATTN, --diffusion-attention-config.per_role.self.backend SPARSE_BLOCK, --diffusion-attention-config.per_role.cross.backend SAGE_ATTN, --diffusion-attention-config '{"default": {"backend": "FLASH_ATTN"}, "per_role": {"cross": {"backend": "SAGE_ATTN"}}}'.

Default: None

--cache-backend

Cache backend for diffusion models, options: 'tea_cache', 'cache_dit', 'mag_cache'

Default: none

--cache-config

JSON string of cache configuration. TeaCache: '{"rel_l1_thresh": 0.2}'. MagCache: '{"mag_threshold": 0.24, "mag_max_skip_steps": 5, "mag_retention_ratio": 0.1}'. Calibration mode: add '"mag_calibrate": true'

Default: None

--enable-cache-dit-summary

Enable cache-dit summary logging after diffusion forward passes.

Default: False

--step-execution

Enable per-step diffusion execution so running requests can be aborted between denoise steps.

Default: False

--vae-use-slicing

Enable VAE slicing for memory optimization (useful for mitigating OOM issues).

Default: False

--vae-use-tiling

Enable VAE tiling for memory optimization (useful for mitigating OOM issues).

Default: False

--disable-multithread-weight-load

Disable multi-threaded safetensors loading (default: enabled with 4 threads).

Default: True

--num-weight-load-threads

Number of threads for parallel weight loading (default: 4).

Default: 4

--enable-cpu-offload

Enable CPU offloading for diffusion models.

Default: False

--enable-layerwise-offload

Enable layerwise (blockwise) offloading on DiT modules.

Default: False

--boundary-ratio

Boundary split ratio for low/high DiT in video models (e.g., 0.875 for Wan2.2).

Default: None

--flow-shift

Scheduler flow_shift for video models (e.g., 5.0 for 720p, 12.0 for 480p).

Default: None

--diffusion-kv-cache-dtype

Diffusion attention KV cache dtype (e.g. fp8). Separate from vLLM --kv-cache-dtype.

Default: None

--diffusion-kv-cache-skip-steps

Diffusion KV-cache quantization skip-step selector, e.g. '0-9,20,25-30'.

Default: None

--diffusion-kv-cache-skip-layers

Diffusion KV-cache quantization skip-layer selector, e.g. '0,1,4-8'.

Default: None

--cfg-parallel-size

Possible choices: 1, 2

Number of devices for CFG parallel computation for diffusion models. Equivalent to setting DiffusionParallelConfig.cfg_parallel_size.

Default: 1

--vae-patch-parallel-size

VAE Patch Parallelism degree for diffusion models. Distributes VAE decode workload across multiple ranks by splitting the latent spatially. Equivalent to setting DiffusionParallelConfig.vae_patch_parallel_size.

Default: 1

--default-sampling-params

Json str for Default sampling parameters, Structure: {"": {: value, ...}, ...} e.g., '{"0": {"num_inference_steps":50, "guidance_scale":1}}'. Currently only supports diffusion models.

Default: None

--max-generated-image-size

The max size of generate image (height * width).

Default: None

--tts-max-instructions-length

Maximum length for TTS voice style instructions (overrides stage config, default: 500).

Default: None

--no-guardrails

Disable Cosmos3 text/video safety guardrails for this server.

Default: False

--enable-diffusion-pipeline-profiler

Enable diffusion pipeline profiler to display stage durations.

Default: False

--enable-ar-profiler

Enable AR stage profiler to include AR stage timing in stage_durations.

Default: False

--auxiliary-text-encoder

Auxiliary text encoder parameters model name or path (especially for Hidream-l1-full).

Default: None