Skip to content

vllm_omni.model_executor.models.step_audio2

Modules:

Name Description
step_audio2
step_audio2_constants

Step-Audio2 configuration constants - Single Source of Truth.

step_audio2_thinker

Step-Audio2 Thinker - Stage 1 LLM for Audio Understanding

step_audio2_token2wav

StepAudio2ForConditionalGeneration

Bases: Module, SupportsMultiModal, SupportsPP

Step-Audio2 Main Controller

Manages two-stage inference pipeline: - Stage 1 (Thinker): Audio understanding and token generation - Stage 2 (Token2Wav): Audio token to waveform synthesis

Usage

Stage 1: Thinker

model = StepAudio2ForConditionalGeneration( vllm_config=config, model_stage="thinker" )

Stage 2: Token2Wav

model = StepAudio2ForConditionalGeneration( vllm_config=config, model_stage="token2wav" )

config instance-attribute

config = config

have_multimodal_outputs instance-attribute

have_multimodal_outputs = True

make_empty_intermediate_tensors instance-attribute

make_empty_intermediate_tensors = (
    make_empty_intermediate_tensors
    if model_stage == "thinker"
    else (lambda: None)
)

model instance-attribute

model = thinker

model_stage instance-attribute

model_stage = (
    "thinker"
    if raw_model_stage in ("thinker", "step_audio2_thinker")
    else raw_model_stage
)

multimodal_config instance-attribute

multimodal_config = multimodal_config

thinker instance-attribute

thinker = init_vllm_registered_model(
    vllm_config=vllm_config,
    prefix=maybe_prefix(prefix, "thinker"),
    hf_config=config,
    architectures=[
        "StepAudio2ThinkerForConditionalGeneration"
    ],
)

token2wav instance-attribute

token2wav = None

vllm_config instance-attribute

vllm_config = vllm_config

compute_logits

compute_logits(hidden_states: Tensor) -> Tensor | None

Compute logits from hidden states

embed_input_ids

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings=None,
    is_multimodal=None,
) -> Tensor

Explicit vLLM embedding hook for both stages.

embed_multimodal

embed_multimodal(**kwargs)

Delegate multimodal embedding to thinker stage only.

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: IntermediateTensors | None = None,
    inputs_embeds: Tensor | None = None,
    **kwargs,
)

Forward pass through the model

For Thinker

Returns hidden states/logits

For Token2Wav: Returns waveform

get_input_embeddings

get_input_embeddings(input_ids: Tensor) -> Tensor

Compatibility helper used by older call sites.

get_language_model

get_language_model() -> Module

Get the underlying language model.

get_multimodal_embeddings

get_multimodal_embeddings(**kwargs)

Get multimodal embeddings - only used in Thinker stage.

get_placeholder_str classmethod

get_placeholder_str(modality: str, i: int) -> str | None

Get placeholder string for a modality

Returns:

Type Description
str | None

For audio: "" (matches processor's audio_token)

load_weights

load_weights(weights)

Load weights

move_submodules_to_devices

move_submodules_to_devices(
    *,
    thinker_device: str | device | None = None,
    token2wav_device: str | device | None = None,
) -> None

Optionally move thinker/token2wav to different devices

Example

model.move_submodules_to_devices( thinker_device='cuda:0', token2wav_device='cuda:1', )