Skip to content

vllm_omni.model_executor.models.higgs_audio_v3.higgs_audio_v3_talker

Stage-0 talker for higgs-audio v3 (Qwen3 backbone, fused multi-codebook).

Architecture: - Backbone: Qwen3 (~4B, 36 layers, 2560 hidden, GQA 32/8). No DualFFN. - Fused multi-codebook embedding: [N*V, D] weight, offset lookup, sum across N - Fused multi-codebook head: same weight (tied), reshape to [L, N, V] - MusicGen-style delay pattern [0,1,...,7] with BOC/EOC - Audio feedback: replace continuation-token embedding with fused codebook embed

Weight loading maps from the HF checkpoint's prefixes: tied.embedding.text_embedding. -> model.embed_tokens. body.layers. -> model.layers. body.norm. -> model.norm. tied.head.text_head. -> lm_head. tied.embedding.modality_embeddings.0.embedding. -> multimodal_embedding. tied.embedding.modality_embeddings.0.model. -> skipped (codec for code2wav) tied.head.modality_heads.0. -> skipped when tied

BOC_ID module-attribute

BOC_ID = 1024

EOC_ID module-attribute

EOC_ID = 1025

logger module-attribute

logger = init_logger(__name__)

HiggsAudioV3TalkerForConditionalGeneration

Bases: Module

Stage-0 talker for higgs-audio v3.

Wraps Qwen3Model backbone + fused multi-codebook modules for TTS generation with MusicGen-style delay pattern sampling and audio feedback embedding.

codebook_size instance-attribute

codebook_size = int(codebook_size)

config instance-attribute

config = hf_config

deferred_prefix_cache_mm_keys instance-attribute

deferred_prefix_cache_mm_keys = {'codes.audio'}

has_postprocess class-attribute instance-attribute

has_postprocess: bool = True

have_multimodal_outputs class-attribute instance-attribute

have_multimodal_outputs: bool = True

lm_head instance-attribute

lm_head = embed_tokens

logits_processor instance-attribute

logits_processor = LogitsProcessor(vocab_size)

modality_head instance-attribute

modality_head = HiggsFusedMultiTextHead(
    num_codebooks, codebook_size, hidden_size
)

model instance-attribute

model = Qwen3Model(
    vllm_config=backbone_vllm_config,
    prefix=f"{prefix}.model" if prefix else "model",
)

multimodal_embedding instance-attribute

multimodal_embedding = HiggsFusedMultiTextEmbedding(
    num_codebooks, codebook_size, hidden_size
)

num_codebooks instance-attribute

num_codebooks = int(num_codebooks)

postprocess_uses_hidden_states class-attribute instance-attribute

postprocess_uses_hidden_states: bool = False

postprocess_uses_multimodal_outputs class-attribute instance-attribute

postprocess_uses_multimodal_outputs: bool = False

postprocess_uses_req_infos class-attribute instance-attribute

postprocess_uses_req_infos: bool = False

prefer_model_sampler class-attribute instance-attribute

prefer_model_sampler: bool = True

requires_full_prefix_cached_hidden_states instance-attribute

requires_full_prefix_cached_hidden_states = False

skips_model_sampler_output_token_history class-attribute instance-attribute

skips_model_sampler_output_token_history: bool = True

supports_omni_decode_step_metadata instance-attribute

supports_omni_decode_step_metadata = True

supports_omni_query_start_loc class-attribute instance-attribute

supports_omni_query_start_loc: bool = True

tie_modality instance-attribute

tie_modality = tie_modality_embeddings

vllm_config instance-attribute

vllm_config = vllm_config

compute_logits

compute_logits(
    hidden_states: Tensor, sampling_metadata: Any = None
) -> Tensor

embed_input_ids

embed_input_ids(input_ids: Tensor, **_: Any) -> Tensor

forward

forward(
    input_ids: Tensor,
    positions: Tensor,
    intermediate_tensors: Any | None = None,
    inputs_embeds: Tensor | None = None,
    **kwargs: Any,
) -> Tensor

load_weights

load_weights(
    weights: Iterable[tuple[str, Tensor]],
) -> set[str]

make_omni_output

make_omni_output(
    model_outputs: Tensor | OmniOutput, **kwargs: Any
) -> OmniOutput

postprocess

postprocess(
    hidden_states_slice: Tensor,
    multimodal_outputs: Any = None,
    **req_infos: Any,
) -> dict[str, Any]

Publish per-request audio codes into model_intermediate_buffer.

Called once per request in batch order. Indexes _last_audio_codes by a running cursor (one row per request per step).

sample

sample(logits: Tensor, sampling_metadata: Any) -> Any

Model-owned sampler with delay-pattern audio dispatch.

Mirrors v2's pattern: bias LM logits to force audio continuation, sample multi-codebook codes via the fused head, apply delay pattern, and accumulate per-request state.

update_decode_step_metadata

update_decode_step_metadata(
    *,
    input_ids: Tensor | None = None,
    positions: Tensor | None = None,
    inputs_embeds: Tensor | None = None,
    omni_query_start_loc: Tensor | None = None,
    **_: Any,
) -> None

Update per-step metadata before runner forward or CUDA graph replay.

HiggsFusedMultiTextEmbedding

Bases: Module

Fused multi-codebook embedding: [N*V, D] weight + offset lookup.

num_codebooks instance-attribute

num_codebooks = num_codebooks

vocab_size instance-attribute

vocab_size = vocab_size

weight instance-attribute

weight = Parameter(
    empty(num_codebooks * vocab_size, hidden_size)
)

forward

forward(codes: Tensor) -> Tensor

HiggsFusedMultiTextHead

Bases: Module

Fused multi-codebook head: [L, D] -> [L, N, V] via one linear.

num_codebooks instance-attribute

num_codebooks = num_codebooks

vocab_size instance-attribute

vocab_size = vocab_size

weight instance-attribute

weight = Parameter(
    empty(num_codebooks * vocab_size, hidden_size)
)

generate

generate(hidden: Tensor) -> Tensor