Skip to content

vllm_omni.model_executor.models.higgs_audio_v3.higgs_audio_v3_tokenizer

Prompt builder for higgs-audio v3 TTS.

Prompt formats

Zero-shot: <|tts|> <|text|> {text tokens} <|audio|> Voice clone (no ref text): <|tts|> <|ref_audio|> [-100]×N <|text|> {text tokens} <|audio|> Voice clone (with ref text): <|tts|> <|ref_text|> {ref text tokens} <|ref_audio|> [-100]×N <|text|> {text tokens} <|audio|>

-100 placeholders are replaced at prefill time with fused multi-codebook embeddings of the delay-pattern-encoded reference audio codes.

AUDIO_PLACEHOLDER_ID module-attribute

AUDIO_PLACEHOLDER_ID = -100

BOC_ID module-attribute

BOC_ID = 1024

EOC_ID module-attribute

EOC_ID = 1025

HiggsAudioV3TokenizerAdapter

Wraps the HF tokenizer and builds TTS prompts.

audio_id instance-attribute

audio_id: int = vocab['<|audio|>']

ref_audio_id instance-attribute

ref_audio_id: int | None = get('<|ref_audio|>')

ref_text_id instance-attribute

ref_text_id: int | None = get('<|ref_text|>')

text_id instance-attribute

text_id: int = vocab['<|text|>']

tokenizer property

tokenizer: Any

tts_id instance-attribute

tts_id: int = vocab['<|tts|>']

build_prompt

build_prompt(
    text: str,
    *,
    num_ref_tokens: int = 0,
    reference_text: str | None = None,
) -> list[int]

Build a TTS prompt.

Parameters:

Name Type Description Default
text str

Target text to synthesize.

required
num_ref_tokens int

Number of delay-pattern reference code rows. 0 means zero-shot (no voice clone).

0
reference_text str | None

Optional transcript of the reference audio.

None

apply_delay_pattern

apply_delay_pattern(codes_tn: Tensor) -> Tensor

Apply MusicGen-style delay pattern to raw codes.

Input: [T, N] raw codes (T frames, N codebooks). Output: [T + N - 1, N] delayed codes with BOC/EOC padding.

Codebook c is delayed by c positions: rows 0..c-1 get BOC, rows c..c+T-1 get real codes, rows c+T..T+N-2 get EOC.

encode_reference_audio

encode_reference_audio(wav: ndarray, sr: int) -> Tensor

Encode a reference audio clip to codec codes [T, num_codebooks].

Uses the same HiggsAudioV2TokenizerModel as v2 (same codec). Returns raw codes before delay pattern application.