Stage configs for vLLM-Omni¶

In vLLM-Omni, the target model is separated into multiple stages, which are processed by different LLMEngines, DiffusionEngines or other types of engines. Depending on different types of stages, such as Autoregressive (AR) stage or Diffusion transformer (DiT) stage, each can choose corresponding schedulers, model workers to load with the Engines in a plug-in fashion.

Note

Default deploy config YAMLs (for example, vllm_omni/deploy/qwen2_5_omni.yaml, vllm_omni/deploy/qwen3_omni_moe.yaml, and vllm_omni/deploy/qwen3_tts.yaml) are bundled and loaded automatically when neither --stage-configs-path nor --deploy-config is provided — the model registry resolves the right pipeline + deploy YAML by model_type. The bundled defaults have been verified on 1xH100 for Qwen2.5-Omni and 2xH100 for Qwen3-Omni. Models that have not yet migrated to the new schema continue to use the legacy vllm_omni/model_executor/stage_configs/<model>.yaml files via --stage-configs-path.

New deploy schema reference¶

The new deploy schema lives under vllm_omni/deploy/ and is paired with a frozen PipelineConfig registered by the model's pipeline.py. Each deploy YAML has these top-level fields:

Field	Type	Required	Default	Description
`base_config`	str (path)	optional	—	Overlay parent (relative or absolute). `stages:` / `platforms:` deep-merged by stage_id; other scalars overlay-wins. Intended for user-authored overlays; prod yamls stay flat.
`async_chunk`	bool	optional	`true`	Enable chunked streaming between stages. Pin to `false` if the pipeline runs end-to-end.
`connectors`	dict	optional	`null`	Named connector specs (`{name, extra}`). Referenced by each stage's `input_connectors` / `output_connectors`. See Connector schema.
`edges`	list	optional	`null`	Explicit edge list for the KV transfer graph. Auto-derived from stage inputs if omitted.
`stages`	list	required	—	Per-stage engine args + wiring (see Stage fields).
`platforms`	dict	optional	`null`	Keyed by `npu` / `rocm` / `xpu`, each contains a `stages:` list with per-platform overrides applied on top of the CUDA defaults.
`pipeline`	str	optional	`null`	Override the auto-detected pipeline registry key (used for structural variants like `qwen2_5_omni_thinker_only`).
`trust_remote_code`	bool	optional	`true`	Pipeline-wide. Trust HF remote code on model load; applies to every stage.
`distributed_executor_backend`	str \| null	optional	`null`	Pipeline-wide. Distributed executor backend forwarded to vLLM (`"mp"`, `"ray"`, `"external_launcher"`). If omitted, vLLM auto-selects backend from runtime topology.
`dtype`	str \| null	optional	`null`	Pipeline-wide. Model dtype for every stage.
`quantization`	str \| null	optional	`null`	Pipeline-wide. Quantization method for every stage.
`enable_prefix_caching`	bool	optional	`false`	Pipeline-wide. Prefix cache toggle applied to every stage.
`enable_chunked_prefill`	bool \| null	optional	`null`	Pipeline-wide. Chunked prefill toggle applied to every stage.
`data_parallel_size`	int	optional	`1`	Pipeline-wide. DP degree for every stage.
`pipeline_parallel_size`	int	optional	`1`	Pipeline-wide. PP degree for every stage.

Note: for diffusion path, distributed_executor_backend currently defaults to mp, and ray / external_launcher are not fully supported yet.

Stage fields¶

Each entry under stages: accepts any StageDeployConfig field directly (no nested engine_args:). Only fields whose value legitimately varies across stages live here; pipeline-wide settings (trust_remote_code, distributed_executor_backend, dtype, quantization, prefix/chunked prefill, DP/PP sizes) are declared at the top level and applied to every stage. Unknown keys fall through to engine_extras: and are forwarded to the engine.

Field	Type	Required	Default	Description
`stage_id`	int	required	—	Stage identity; matched against `PipelineConfig.stages[*].stage_id`.
`max_num_seqs`	int	optional	`64`	Max concurrent sequences per stage.
`gpu_memory_utilization`	float	optional	`0.9`	Per-stage memory budget.
`tensor_parallel_size`	int	optional	`1`	TP degree for this stage.
`enforce_eager`	bool	optional	`false`	Disable CUDA graphs.
`max_num_batched_tokens`	int	optional	`32768`	Prefill budget.
`max_model_len`	int \| null	optional	`null`	Per-stage context length (auto-sets `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1` when larger than HF default).
`async_scheduling`	bool \| null	optional	`null`	Per-stage async scheduling toggle.
`devices`	str	optional	`"0"`	`CUDA_VISIBLE_DEVICES`-style device list.
`output_connectors`	dict \| null	optional	`null`	Keyed by `to_stage_<n>`; values are names registered under top-level `connectors:`.
`input_connectors`	dict \| null	optional	`null`	Keyed by `from_stage_<n>`; values are names registered under top-level `connectors:`.
`default_sampling_params`	dict \| null	optional	`null`	Baseline sampling params. Deep-merged with pipeline `sampling_constraints` (pipeline wins).
`engine_extras`	dict	optional	`{}`	Catch-all for keys not listed above; deep-merged across overlays. Also carries per-stage overrides of pipeline-wide settings (e.g. stage-specific `dtype`).

Connector schema¶

Each entry under top-level connectors: follows this shape:

connectors:
  <connector_name>:
    name: <ConnectorClassName>     # required — class registered in vllm_omni.distributed
    extra:                         # optional — forwarded to the connector's __init__
      <key>: <value>
      ...

Connector class	Use case	`extra` keys
`SharedMemoryConnector`	Same-host KV transfer between stages (default for bundled YAMLs).	`shm_threshold_bytes` (int, default `65536`).
`MooncakeStoreConnector`	Cross-host KV transfer over TCP. Required for multi-node deployments.	`host`, `metadata_server`, `master`, `segment` (int bytes), `localbuf` (int bytes), `proto` (`"tcp"` / `"rdma"`).

A stage references a connector by name in its input_connectors / output_connectors:

connectors:
  shm:
    name: SharedMemoryConnector

stages:
  - stage_id: 0
    output_connectors: {to_stage_1: shm}
  - stage_id: 1
    input_connectors:  {from_stage_0: shm}

CLI flags introduced in this refactor¶

Flag	Description
`--deploy-config PATH`	Load a new-schema deploy YAML. Takes precedence over `--stage-configs-path`. Optional — when omitted, the bundled `vllm_omni/deploy/<model_type>.yaml` is auto-loaded by the model registry.
`--stage-overrides JSON`	Per-stage JSON overrides, e.g. `'{"0":{"gpu_memory_utilization":0.5}}'`. Per-stage values always win over global flags.
`--async-chunk` / `--no-async-chunk`	Flip the deploy YAML's `async_chunk:` bool. Unset (default) leaves the YAML value in force.
`--stage-configs-path`	Deprecated. Accepts legacy `stage_args` yamls and (auto-detected) new deploy yamls; emits a deprecation warning. Migrate to `--deploy-config`. To be removed in a follow-up PR.

Stage-Based CLI Paradigm¶

The stage-based CLI paradigm facilitates the execution of discrete pipeline stages within isolated processes:

Stage 0 typically encapsulates the orchestrator and the primary API server. Invocation requires --stage-id 0, --omni-master-address, --omni-master-port, and standard port declarations (e.g., --port).
Worker Stages operate without a distinct API server (i.e., using --headless), are assigned sequential --stage-id identifiers, and must reference the corresponding --omni-master-address and --omni-master-port parameters to successfully register with Stage 0.

For migrated architectures, the system automatically resolves and loads the bundled deployment YAML. Consequently, the primary execution path does not necessitate the explicit definition of --deploy-config: the example below uses CUDA_VISIBLE_DEVICES=0 for Stage 0 and CUDA_VISIBLE_DEVICES=1 for Stage 1.

CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --port 8091 \
    --stage-id 0 \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --stage-id 1 \
    --headless \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

When instantiating a custom deployment YAML conforming to the updated schema, append the --deploy-config /path/to/override.yaml directive to all node invocations. For legacy architectures (e.g., BAGEL) configured via deprecated stage_args: schemas, continue to specify the relevant configuration via --stage-configs-path /path/to/config.yaml.

In the context of standard initialization architectures, utilizing the --stage-overrides parameter operates as the optimal methodology for delineating stage-specific tuning from the CLI interface:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'

Conversely, in the context of the stage-based CLI paradigm, given that each execution process exclusively instantiates a single pipeline stage, configuration override attributes can be defined uniformly via explicit CLI flags on the corresponding instantiation command, rendering composite --stage-overrides JSON strings unnecessary:

CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --stage-id 1 \
    --headless \
    --gpu-memory-utilization 0.5 \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

Precedence¶

From highest to lowest:

Per-stage flags (--stage-overrides JSON, --stage-<id>-<key> if registered)
Explicit global CLI flags (--gpu-memory-utilization 0.85, etc.)
Platform section (platforms.npu.stages, etc.) on top of the base stages:
Overlay YAML (via base_config:) on top of the base YAML
Parser defaults

Worked override example¶

Starting from the bundled vllm_omni/deploy/qwen3_omni_moe.yaml:

# vllm_omni/deploy/qwen3_omni_moe.yaml (excerpt)
async_chunk: true
stages:
  - stage_id: 0
    gpu_memory_utilization: 0.9
    max_num_seqs: 32
  - stage_id: 1
    gpu_memory_utilization: 0.7
    max_num_seqs: 16

A user-authored overlay that inherits the base and overrides only stage 1:

# my_overrides.yaml
base_config: /path/to/vllm_omni/deploy/qwen3_omni_moe.yaml
stages:
  - stage_id: 1
    gpu_memory_utilization: 0.5     # smaller GPU

Launched with both an explicit global flag and a per-stage override:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --deploy-config my_overrides.yaml \
    --max-model-len 16384 \
    --stage-overrides '{"0": {"max_num_seqs": 8}}'

Within the stage-based CLI paradigm, equivalent configuration parameters can inherently be passed directly as command-line arguments to the designated single-stage process instantiation:

CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
    --stage-id 0 \
    --max-num-seqs 8 \
    --omni-master-address 127.0.0.1 \
    --omni-master-port 26000

Effective config per stage after the merge:

Stage	Field	Final value	Source
0	`gpu_memory_utilization`	`0.9`	base YAML (overlay didn't touch stage 0)
0	`max_num_seqs`	`8`	per-stage CLI (`--stage-overrides`) — wins over base `32`
0	`max_model_len`	`16384`	global CLI
1	`gpu_memory_utilization`	`0.5`	overlay YAML — wins over base `0.7`
1	`max_num_seqs`	`16`	base YAML (overlay didn't touch this field)
1	`max_model_len`	`16384`	global CLI
2	(all defaults)	—	base YAML (no overrides apply)

Therefore, as a core part of vLLM-Omni, the stage configs for a model have several main functions:

Claim partition of stages and their corresponding class implementation in model_executor/models.
The disaggregated configuration for each stage and the communication topology among them.
Engine arguments for each engine within the stage.
Input and output dependencies for each stage.
Default input parameters.

To override specific parameters, explicitly inject the customized configuration schema in both online and offline instantiation flows. Prioritize the --deploy-config flag when loading the new-schema deploy YAML schemas, reserving the --stage-configs-path parameter exclusively to maintain compatibility with legacy stage_args YAML constructs.

Examples:

For offline (Assume necessary dependencies have been imported):

model_name = "Qwen/Qwen2.5-Omni-7B"
omni = Omni(model=model_name, stage_configs_path="/path/to/custom_stage_configs.yaml")

For online serving:

vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --deploy-config /path/to/deploy_config.yaml

Legacy online serving:

vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/stage_configs_file

Important

We are actively iterating on the definition of stage configs, and we welcome all feedbacks from both community users and developers to help us shape the development!

Below is a specific example of stage_configs.yaml in Qwen2.5-omni.

# stage config for running qwen2.5-omni with AsyncOmniEngine + Orchestrator runtime.
stage_args:
  - stage_id: 0 # mark the unique id for each stage
    runtime: # The disaggregated configuration
      process: true  # Run this stage in a separate process
      devices: "0" # Logical device index for this stage (mapped through CUDA_VISIBLE_DEVICES / ASCEND_RT_VISIBLE_DEVICES if set)
    engine_args: # Engine arguments for a certain engine
      model_stage: thinker
      max_num_seqs: 1
      model_arch: Qwen2_5OmniForConditionalGeneration # The model implementation registered in model_executor/models/registry.py
      worker_type: ar # The specific worker used
      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler # The specific scehduler used
      gpu_memory_utilization: 0.8 # The gpu memory allocation for the stage within a single chip
      enforce_eager: true  # Now we only support eager mode
      trust_remote_code: true # Needed by huggingface config parsing
      engine_output_type: latent  # It claims that the stage will input latent hiddenstates besides token ids
      enable_prefix_caching: false # For request with hiddenstates output, the prefix caching is not supported now
    is_comprehension: true # If the stage is a text or multimodal comprehension module. If it is, the AsyncOmni will use its tokenizer as default
    final_output: true # If the stage has output as part of final outputs. If it is false, which means that the stage only works as a intermediate role.
    final_output_type: text # What is the final output type. It can be text and audio now.
    default_sampling_params: # sampling parameters for the stage. Their meaning aligns with vLLM.
      temperature: 0.0
      top_p: 1.0
      top_k: -1
      max_tokens: 2048
      seed: 42
      detokenize: True
      repetition_penalty: 1.1
  - stage_id: 1
    runtime:
      process: true
      devices: "1"
    engine_args:
      model_stage: talker
      max_num_seqs: 3
      model_arch: Qwen2_5OmniForConditionalGeneration
      worker_type: ar
      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
      gpu_memory_utilization: 0.8
      enforce_eager: true
      trust_remote_code: true
      enable_prefix_caching: false
      engine_output_type: latent
    engine_input_source: [0]
    default_sampling_params:
      temperature: 0.9
      top_p: 0.8
      top_k: 40
      max_tokens: 2048
      seed: 42
      detokenize: True
      repetition_penalty: 1.05
      stop_token_ids: [8294]
  - stage_id: 2
    runtime:
      process: true
      devices: "0"            # Example: use a different GPU than the previous stage; use "0" if single GPU
    engine_args:
      model_stage: code2wav
      max_num_seqs: 1
      model_arch: Qwen2_5OmniForConditionalGeneration
      worker_type: generation
      scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
      gpu_memory_utilization: 0.15
      enforce_eager: true
      trust_remote_code: true
      enable_prefix_caching: false
      engine_output_type: audio
    engine_input_source: [1]
    final_output: true
    final_output_type: audio
    default_sampling_params:
      temperature: 0.0
      top_p: 1.0
      top_k: -1
      max_tokens: 2048
      seed: 42
      detokenize: True
      repetition_penalty: 1.1

# Top-level runtime config (concise): default windows and stage edges
runtime:
  enabled: true

  edges:
    - from: 0                   # thinker → talker: trigger only after receiving full input (-1)
      to: 1
    - from: 1                   # talker → code2wav: trigger only after receiving full input (-1)
      to: 2

Stage Configuration Arguments¶

Each stage in the stage_args list contains the following configuration options:

`stage_id`¶

A unique identifier for each stage in the multi-stage pipeline. Stages are numbered sequentially starting from 0, and this ID is used to reference stages in inter-stage dependencies (e.g., engine_input_source).

`prompt_expand_func` (Optional)¶

A custom Python function hook for the LLM stage (Stage 0) that expands a single incoming prompt object into multiple prompts. This is primarily used for multi-modal Classifier-Free Guidance (CFG), where it generates the necessary companion requests (like a negative text prompt) and tags them with internal roles (e.g., cfg_text). This ensures the upstream LLM generates the needed contextual hidden states for both the conditional and unconditional generations simultaneously.

`cfg_kv_collect_func` (Optional)¶

A custom Python function hook for downstream diffusion stages (Stage 1+) to collect, map, and process the KV caches transferred from the companion requests fired by prompt_expand_func. It aggregates the hidden condition states cleanly (e.g., binding them as cfg_text_past_key_values and cfg_text_kv_metadata), allowing the diffusion runtime to perform CFG smoothly without redundantly evaluating text paths on the DiT workers.

`runtime`¶

Configuration for disaggregated execution of the stage, controlling how the stage is deployed and executed.

`runtime.process`¶

Whether to run this stage in a separate process. When set to true, the stage will be executed in an isolated process, enabling better resource isolation and parallel execution across different stages. This is essential for multi-GPU deployments where different stages run on different devices.

Default: true

`runtime.devices`¶

Logical device indices for this stage, specified as a string. Values are logical indices (0, 1, 2, ...) — not physical GPU IDs — and are mapped through the platform's visibility env var (CUDA_VISIBLE_DEVICES on CUDA, ASCEND_RT_VISIBLE_DEVICES on NPU) before being applied via torch.cuda.set_device() (or the equivalent).

Example: if CUDA_VISIBLE_DEVICES=0,2,4 is set in the environment, then devices: "0" selects physical GPU 0 (the first visible), devices: "1" selects physical GPU 2, and devices: "0,1" makes physical GPUs 0 and 2 available to the stage. If no visibility env var is set, logical and physical IDs coincide.

Default: "0"

`engine_args.max_num_seqs`¶

The maximum number of sequences for concurrent processing in this stage. For LLM stages, this controls the vLLM scheduler's maximum concurrent sequences. For all stage types, this also controls how many tasks can be batched together in the task processing loop.

Default: 1

`engine_args.request_batch_max_wait_ms`¶

The maximum time, in milliseconds, that a diffusion request-mode stage may wait before the first schedule() of a new scheduler wave so compatible requests can accumulate for request-level batching. This only applies to diffusion pipelines that support request-level batching with step_execution disabled.

Use this together with max_num_seqs > 1 for bursty serving traffic. 0 disables admission waiting and preserves the lowest first-request latency. For diffusion request-level batching tuning, see Request-Level Batching.

Default: 0.0

`engine_args`¶

Engine arguments for configuring the LLM engine, diffusion engine, or other engine types used by this stage.

`engine_args.model_stage`¶

The name identifier for this model stage within the multi-stage architecture. This is used internally to distinguish different stages of the same model (e.g., "thinker", "talker", "code2wav" in Qwen2.5-Omni).

`engine_args.model_arch`¶

The model architecture class name that is registered in model_executor/models/registry.py. This specifies which model implementation to use for this stage. The class must be registered in the model registry for vLLM-Omni to locate and instantiate it.

`engine_args.worker_cls`¶

The specific worker class to use for this stage. This determines how the model computations are executed. Examples include vllm_omni.worker.gpu_ar_worker.GPUARWorker for autoregressive stages and vllm_omni.worker.gpu_generation_worker.GPUGenerationWorker for diffusion-based stages.

`engine_args.scheduler_cls`¶

The scheduler class to use for this stage. The scheduler manages request queuing, batching, and execution order. Examples include vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler for standard stages and vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler for diffusion stages.

`engine_args.gpu_memory_utilization`¶

The fraction of GPU memory to allocate for this stage within a single GPU chip. This is a value between 0.0 and 1.0, where 0.8 means 80% of the GPU memory will be used by this stage. This allows fine-grained control over memory allocation when multiple stages share the same GPU or when reserving memory for other operations.

Default: 0.8

Memory Configuration Guide

For detailed information on how to calculate memory requirements and properly configure gpu_memory_utilization, see the GPU Memory Calculation and Configuration Guide.

`engine_args.enforce_eager`¶

Whether to enforce eager execution mode. When set to true, the engine will run in eager mode without using CUDA graphs or other compilation optimizations. Currently, vLLM-Omni only supports eager mode.

Default: true

`engine_args.trust_remote_code`¶

Whether to trust remote code when loading models from Hugging Face. This is required for models that use custom code in their configuration files. Set to true when loading models that require custom model implementations.

Default: true

`engine_args.engine_output_type`¶

Specifies the type of output produced by this stage's engine. This determines what kind of data flows to downstream stages. Possible values include latent (hidden states), text (tokenized text), and audio (audio waveforms). When set to latent, the stage outputs latent hidden states in addition to token IDs, which are consumed by downstream stages.

Default: latent

`engine_args.enable_prefix_caching`¶

Whether to enable prefix caching for this stage. Prefix caching can improve performance by caching KV cache for common prompt prefixes. However, for requests that output hidden states (when engine_output_type is latent), prefix caching is not currently supported and should be set to false.

Default: false

`is_comprehension`¶

Whether this stage is a text or multimodal comprehension module. When set to true, the stage acts as a comprehension module that processes input text or multimodal content. If this is the first comprehension stage, AsyncOmni will use its tokenizer as the default tokenizer for the entire pipeline.

Default: true

`final_output`¶

Whether this stage produces output that is part of the final outputs returned to the user. When set to false, the stage only works as an intermediate stage, processing data that flows to downstream stages but not contributing directly to the final response.

Default: true

`final_output_type`¶

The type of final output produced by this stage. This specifies what format the output will be in when returned to the user. Currently supported values are text (for text generation) and audio (for audio generation).

Default: text

`default_sampling_params`¶

Default sampling parameters for this stage. These parameters control the generation behavior and align with vLLM's sampling parameter semantics. These defaults are used when no explicit sampling parameters are provided in the request.

`default_sampling_params.temperature`¶

Sampling temperature for controlling randomness. Lower values (e.g., 0.0) make the output more deterministic and focused, while higher values increase randomness.

Default: 0.0

`default_sampling_params.top_p`¶

Nucleus sampling parameter. Only tokens with cumulative probability mass up to top_p are considered. This helps filter out low-probability tokens while maintaining diversity.

Default: 1.0

`default_sampling_params.top_k`¶

Top-k sampling parameter. Only the top k most likely tokens are considered. Set to -1 to disable top-k filtering and consider all tokens.

Default: -1

`default_sampling_params.max_tokens`¶

Maximum number of tokens to generate in this stage. This limits the length of the output sequence.

Default: 2048

`default_sampling_params.seed`¶

Random seed for reproducible generation. When set, the random number generator will be initialized with this seed to ensure consistent outputs across runs.

Default: 42

`default_sampling_params.detokenize`¶

Whether to detokenize the output tokens into text. When set to true, token IDs are converted back to readable text strings.

Default: True

`default_sampling_params.repetition_penalty`¶

Penalty applied to tokens that have already appeared in the generated sequence. Values greater than 1.0 discourage repetition, while values less than 1.0 encourage it. A value of 1.0 applies no penalty.

Default: 1.1

`tts_args` (TTS stages only)¶

Configuration for Text-to-Speech specific parameters. This section is only applicable to TTS model stages (e.g., qwen3_tts).

`tts_args.max_instructions_length`¶

Maximum character length for voice style/emotion instructions. Instructions exceeding this limit will be rejected with a validation error.

Default: 500

This value can be overridden at runtime using the --tts-max-instructions-length CLI parameter when starting the server.

Stage configs for vLLM-Omni¶

New deploy schema reference¶

Stage fields¶

Connector schema¶

CLI flags introduced in this refactor¶

Stage-Based CLI Paradigm¶

Precedence¶

Worked override example¶

Stage Configuration Arguments¶

stage_id¶

prompt_expand_func (Optional)¶

cfg_kv_collect_func (Optional)¶

runtime¶

runtime.process¶

runtime.devices¶

engine_args.max_num_seqs¶

engine_args.request_batch_max_wait_ms¶

engine_args¶

engine_args.model_stage¶

engine_args.model_arch¶

engine_args.worker_cls¶

engine_args.scheduler_cls¶

engine_args.gpu_memory_utilization¶

engine_args.enforce_eager¶

engine_args.trust_remote_code¶

engine_args.engine_output_type¶

engine_args.enable_prefix_caching¶

is_comprehension¶

final_output¶

final_output_type¶

default_sampling_params¶

default_sampling_params.temperature¶

default_sampling_params.top_p¶

default_sampling_params.top_k¶

default_sampling_params.max_tokens¶

default_sampling_params.seed¶

default_sampling_params.detokenize¶

default_sampling_params.repetition_penalty¶

tts_args (TTS stages only)¶

tts_args.max_instructions_length¶

`stage_id`¶

`prompt_expand_func` (Optional)¶

`cfg_kv_collect_func` (Optional)¶

`runtime`¶

`runtime.process`¶

`runtime.devices`¶

`engine_args.max_num_seqs`¶

`engine_args.request_batch_max_wait_ms`¶

`engine_args`¶

`engine_args.model_stage`¶

`engine_args.model_arch`¶

`engine_args.worker_cls`¶

`engine_args.scheduler_cls`¶

`engine_args.gpu_memory_utilization`¶

`engine_args.enforce_eager`¶

`engine_args.trust_remote_code`¶

`engine_args.engine_output_type`¶

`engine_args.enable_prefix_caching`¶

`is_comprehension`¶

`final_output`¶

`final_output_type`¶

`default_sampling_params`¶

`default_sampling_params.temperature`¶

`default_sampling_params.top_p`¶

`default_sampling_params.top_k`¶

`default_sampling_params.max_tokens`¶

`default_sampling_params.seed`¶

`default_sampling_params.detokenize`¶

`default_sampling_params.repetition_penalty`¶

`tts_args` (TTS stages only)¶

`tts_args.max_instructions_length`¶