vllm-omni serve¶
Stage-based CLI quickstart¶
The stage-based CLI is designed for deployments that require launching each pipeline stage in an isolated process (e.g., across separate operating system processes, distinct GPUs, or distributed hosts).
- For migrated models that utilize the bundled deployment YAML configurations located in
vllm_omni/deploy/, the--deploy-configflag is only required to override the default configuration. By default, executingvllm serve MODEL --omni ...automatically loads the bundled deployment configuration. - For legacy models utilizing configuration files located in
vllm_omni/model_executor/stage_configs/, the--stage-configs-pathparameter remains mandatory.
Example: Initializing Stage 0 (Orchestrator and API Server): The commands below show a common device mapping where Stage 0 uses GPU 0 and worker stages use GPU 1 via CUDA_VISIBLE_DEVICES.
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--port 8091 \
--stage-id 0 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
Example: Initializing a Headless Worker Stage (Stage 1):
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
When utilizing a custom deployment YAML based on the new schema, append --deploy-config /path/to/override.yaml to each command execution. Conversely, for legacy models, substitute this parameter with --stage-configs-path /path/to/stage_configs.yaml.
In the standard execution paradigm, the --stage-overrides argument is utilized to apply stage-specific configurations from a single CLI command. However, under the stage-based CLI paradigm, where each process strictly encapsulates a single stage, it is recommended to specify tuning parameters directly via discrete command-line flags for the respective stage, rather than constructing a composite --stage-overrides JSON string.
For example, as an alternative to the following composite configuration:
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'
the stage-based CLI permits the direct initialization of Stage 1 with explicit parameters:
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--gpu-memory-utilization 0.5 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
JSON CLI Arguments¶
Arguments¶
OmniConfig¶
Configuration for vLLM-Omni multi-stage and diffusion models.
--omni¶
Enable vLLM-Omni mode for multi-modal and diffusion models
Default: False
--enable-sleep-mode¶
Enable GPU memory pool for sleep mode.
Default: False
--task-type¶
Possible choices: CustomVoice, VoiceDesign, Base
Default task type for TTS models (CustomVoice, VoiceDesign, or Base). If not specified, will be inferred from model path.
Default: None
--stage-configs-path¶
[Deprecated — will be removed in a future release] Path to a legacy stage configs YAML (stage_args format). Prefer --deploy-config for new-format deploy YAMLs.
Default: None
--deploy-config¶
Path to a deploy config YAML (new format with stages/engine_args). Mutually exclusive with --stage-configs-path.
Default: None
--stage-overrides¶
Per-stage JSON overrides. Example: '{"0": {"gpu_memory_utilization": 0.8}, "2": {"enforce_eager": true}}'
Default: None
--async-chunk, --no-async-chunk¶
Override the deploy YAML's async_chunk: bool. Unset leaves the YAML value in force.
Default: None
--stage-id¶
Select and launch a single stage by stage_id.
Default: None
--replica-id¶
Deprecated and ignored — replica ids are auto-assigned by the master server. Specifying this flag prints a warning and has no effect.
Default: None
--stage-init-timeout¶
The timeout for initializing a single stage in seconds (default: 300)
Default: 300
--init-timeout¶
The timeout for initializing the stages.
Default: 600
--shm-threshold-bytes¶
The threshold for the shared memory size.
Default: 65536
--log-stats¶
Enable logging the stats.
Default: False
--log-file¶
The path to the log file.
Default: None
--batch-timeout¶
The timeout for the batch.
Default: 10
--worker-backend¶
Possible choices: multi_process, ray
The backend to use for stage workers.
Default: multi_process
--ray-address¶
The address of the Ray cluster to connect to.
Default: None
--omni-master-address, -oma¶
Hostname or IP address of the Omni orchestrator (master).
Default: None
--omni-master-port, -omp¶
Port of the Omni orchestrator (master).
Default: None
--omni-replica-address, -ora¶
Local bind address (this host's IP) that the headless stage advertises to the Omni master for its handshake/input/output ZMQ sockets. If unset, auto-detected via a UDP-connect routing probe against --omni-master-address. Override only when the auto-detected IP is wrong (e.g. multi-NIC host where the master is reachable on the wrong interface).
Default: None
--omni-dp-size-local¶
Number of stage replicas this runtime launches locally for its own --stage-id. Process-local: head and every headless invocation read their own copy; values may differ across invocations. Requires --stage-id to be set when not equal to 1.
Default: 1
--omni-lb-policy¶
Possible choices: random, round-robin, least-queue-length
Per-stage load-balancing policy used by the head's StagePool to route requests across UP replicas. Only consulted on the head runtime.
Default: random
--omni-heartbeat-timeout¶
Seconds before an unreporting replica is marked ERROR in the OmniCoordinator. Only consulted on the head runtime.
Default: 30.0
--num-gpus¶
Number of GPUs to use for diffusion model inference.
Default: None
--model-class-name¶
Override the diffusion pipeline class name (e.g. LTX2ImageToVideoPipeline).
Default: None
--diffusion-load-format¶
Possible choices: default, custom_pipeline, dummy, diffusers
How to load the diffusion pipeline: native/registry (default), custom_pipeline, dummy, or diffusers for the HF diffusers adapter.
Default: None
--diffusers-load-kwargs¶
JSON object passed to DiffusionPipeline.from_pretrained().It overrides corresponding parameters in the standard vLLM-Omni interface.(e.g. '{"use_safetensors": true, "variant": "fp16"}').
Default: {}
--diffusers-call-kwargs¶
JSON object passed to pipeline.call(). Useful for model-specific sampling parameters not covered by the vLLM-Omni interface.During request time, it is overridden by corresponding parameters in the vLLM-Omni interface.(e.g. '{"num_inference_steps": 30, "guidance_scale": 7.5}').
Default: {}
--usp, --ulysses-degree¶
Ulysses Sequence Parallelism degree for diffusion models. Equivalent to setting DiffusionParallelConfig.ulysses_degree.
Default: None
--ulysses-mode¶
Possible choices: strict, advanced_uaa
Ulysses sequence-parallel mode for diffusion models. 'strict' keeps the original divisibility requirements; 'advanced_uaa' enables the experimental UAA path for uneven sequence/head shapes.
Default: strict
--ring, --ring-degree¶
Ring Sequence Parallelism degree for diffusion models. Equivalent to setting DiffusionParallelConfig.ring_degree.
Default: None
--diffusion-quantization-config¶
JSON string for diffusion quantization_config. Example: '{"method":"gguf","gguf_model":"/path/to/model.gguf"}'.
Default: None
--force-cutlass-fp8¶
Diffusion-only runtime override for ModelOpt FP8 checkpoints: force CUTLASS FP8 linear kernels on CUDA SM89+ devices. Ignored for BF16, non-ModelOpt FP8, ROCm, and older CUDA GPUs.
Default: None
--use-hsdp¶
Enable HSDP (Hybrid Sharded Data Parallel) for diffusion models. Shards model weights across GPUs to reduce per-GPU memory usage.
Default: False
--hsdp-shard-size¶
Number of GPUs to shard weights across. -1 = auto (world_size / replicate_size).
Default: -1
--hsdp-replicate-size¶
Number of replica groups for HSDP. Each group holds a full sharded copy.
Default: 1
--diffusion-attention-backend¶
Diffusion attention backend (shorthand). Sets the default backend for all diffusion attention roles, e.g. 'FLASH_ATTN'. May be combined with --diffusion-attention-config.per_role.* overrides, but mutually exclusive with --diffusion-attention-config.default.backend.
Default: None
--diffusion-attention-config, -dac¶
Diffusion attention config. Accepts JSON or vLLM-style dotted flags. Examples: --diffusion-attention-config.default.backend FLASH_ATTN, --diffusion-attention-config.per_role.self.backend SPARSE_BLOCK, --diffusion-attention-config.per_role.cross.backend SAGE_ATTN, --diffusion-attention-config '{"default": {"backend": "FLASH_ATTN"}, "per_role": {"cross": {"backend": "SAGE_ATTN"}}}'.
Default: None
--cache-backend¶
Cache backend for diffusion models, options: 'tea_cache', 'cache_dit', 'mag_cache'
Default: none
--cache-config¶
JSON string of cache configuration. TeaCache: '{"rel_l1_thresh": 0.2}'. MagCache: '{"mag_threshold": 0.24, "mag_max_skip_steps": 5, "mag_retention_ratio": 0.1}'. Calibration mode: add '"mag_calibrate": true'
Default: None
--enable-cache-dit-summary¶
Enable cache-dit summary logging after diffusion forward passes.
Default: False
--step-execution¶
Enable per-step diffusion execution so running requests can be aborted between denoise steps.
Default: False
--vae-use-slicing¶
Enable VAE slicing for memory optimization (useful for mitigating OOM issues).
Default: False
--vae-use-tiling¶
Enable VAE tiling for memory optimization (useful for mitigating OOM issues).
Default: False
--disable-multithread-weight-load¶
Disable multi-threaded safetensors loading (default: enabled with 4 threads).
Default: True
--num-weight-load-threads¶
Number of threads for parallel weight loading (default: 4).
Default: 4
--enable-cpu-offload¶
Enable CPU offloading for diffusion models.
Default: False
--enable-layerwise-offload¶
Enable layerwise (blockwise) offloading on DiT modules.
Default: False
--boundary-ratio¶
Boundary split ratio for low/high DiT in video models (e.g., 0.875 for Wan2.2).
Default: None
--flow-shift¶
Scheduler flow_shift for video models (e.g., 5.0 for 720p, 12.0 for 480p).
Default: None
--diffusion-kv-cache-dtype¶
Diffusion attention KV cache dtype (e.g. fp8). Separate from vLLM --kv-cache-dtype.
Default: None
--diffusion-kv-cache-skip-steps¶
Diffusion KV-cache quantization skip-step selector, e.g. '0-9,20,25-30'.
Default: None
--diffusion-kv-cache-skip-layers¶
Diffusion KV-cache quantization skip-layer selector, e.g. '0,1,4-8'.
Default: None
--cfg-parallel-size¶
Possible choices: 1, 2
Number of devices for CFG parallel computation for diffusion models. Equivalent to setting DiffusionParallelConfig.cfg_parallel_size.
Default: 1
--vae-patch-parallel-size¶
VAE Patch Parallelism degree for diffusion models. Distributes VAE decode workload across multiple ranks by splitting the latent spatially. Equivalent to setting DiffusionParallelConfig.vae_patch_parallel_size.
Default: 1
--default-sampling-params¶
Json str for Default sampling parameters, Structure: {"
Default: None
--max-generated-image-size¶
The max size of generate image (height * width).
Default: None
--tts-max-instructions-length¶
Maximum length for TTS voice style instructions (overrides stage config, default: 500).
Default: None
--no-guardrails¶
Disable Cosmos3 text/video safety guardrails for this server.
Default: False
--enable-diffusion-pipeline-profiler¶
Enable diffusion pipeline profiler to display stage durations.
Default: False
--enable-ar-profiler¶
Enable AR stage profiler to include AR stage timing in stage_durations.
Default: False
--auxiliary-text-encoder¶
Auxiliary text encoder parameters model name or path (especially for Hidream-l1-full).
Default: None