vllm_omni.model_executor.models.step_audio2 ¶

Modules:

Name	Description
`step_audio2`
`step_audio2_constants`	Step-Audio2 configuration constants - Single Source of Truth.
`step_audio2_thinker`	Step-Audio2 Thinker - Stage 1 LLM for Audio Understanding
`step_audio2_token2wav`

StepAudio2ForConditionalGeneration ¶

Bases: Module, SupportsMultiModal, SupportsPP

Step-Audio2 Main Controller

Manages two-stage inference pipeline: - Stage 0 (Thinker): Audio understanding and token generation - Stage 1 (Token2Wav): Audio token to waveform synthesis

Usage

Stage 0: Thinker¶

model = StepAudio2ForConditionalGeneration( vllm_config=config, model_stage="thinker" )

Stage 1: Token2Wav¶

model = StepAudio2ForConditionalGeneration( vllm_config=config, model_stage="token2wav" )

config `instance-attribute` ¶

config = config

have_multimodal_outputs `instance-attribute` ¶

have_multimodal_outputs = True

make_empty_intermediate_tensors `instance-attribute` ¶

make_empty_intermediate_tensors = (
    self.thinker.make_empty_intermediate_tensors
    if self.model_stage == "thinker"
    else (lambda: None)
)

model `instance-attribute` ¶

model = self.thinker

model_stage `instance-attribute` ¶

model_stage = (
    "thinker"
    if raw_model_stage in ("thinker", "step_audio2_thinker")
    else raw_model_stage
)

multimodal_config `instance-attribute` ¶

multimodal_config = multimodal_config

thinker `instance-attribute` ¶

thinker = init_vllm_registered_model(
    vllm_config=vllm_config,
    prefix=maybe_prefix(prefix, "thinker"),
    hf_config=config,
    architectures=[
        "StepAudio2ThinkerForConditionalGeneration"
    ],
)

token2wav `instance-attribute` ¶

token2wav = None

vllm_config `instance-attribute` ¶

vllm_config = vllm_config

compute_logits ¶

compute_logits(hidden_states: Tensor) -> Tensor | None

Compute logits from hidden states

embed_input_ids ¶

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings=None,
    is_multimodal=None,
) -> Tensor

Explicit vLLM embedding hook for both stages.

embed_multimodal ¶

embed_multimodal(**kwargs)

Delegate multimodal embedding to thinker stage only.

forward ¶

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    **kwargs,
)

Forward pass through the model

For Thinker

Returns hidden states/logits

For Token2Wav: Returns waveform

get_input_embeddings ¶

get_input_embeddings(input_ids: Tensor) -> Tensor

Compatibility helper used by older call sites.

get_language_model ¶

get_language_model() -> Module

Get the underlying language model.

get_multimodal_embeddings ¶

get_multimodal_embeddings(**kwargs)

Get multimodal embeddings - only used in Thinker stage.

get_placeholder_str `classmethod` ¶

get_placeholder_str(modality: str, i: int) -> str | None

Get placeholder string for a modality

Returns:

Type	Description
`str \| None`	For audio: "" (matches processor's audio_token)

load_weights ¶

load_weights(weights)

Load weights

move_submodules_to_devices ¶

move_submodules_to_devices(
    *,
    thinker_device: str | device | None = None,
    token2wav_device: str | device | None = None,
) -> None

Optionally move thinker/token2wav to different devices

Example

model.move_submodules_to_devices( thinker_device='cuda:0', token2wav_device='cuda:1', )

vllm_omni.model_executor.models.step_audio2 ¶

StepAudio2ForConditionalGeneration ¶

Stage 0: Thinker¶

Stage 1: Token2Wav¶

config instance-attribute ¶

have_multimodal_outputs instance-attribute ¶

make_empty_intermediate_tensors instance-attribute ¶

model instance-attribute ¶

model_stage instance-attribute ¶

multimodal_config instance-attribute ¶

thinker instance-attribute ¶

token2wav instance-attribute ¶

vllm_config instance-attribute ¶