Skip to content

vllm_omni.transformers_utils.configs.fish_speech

Fish Speech S2 Pro config registration with transformers AutoConfig.

Registers FishSpeechConfig (model_type="fish_qwen3_omni") and sub-configs so that AutoConfig.from_pretrained("fishaudio/s2-pro") returns the correct config class.

FishSpeechConfig

Bases: PretrainedConfig

Top-level config for Fish Speech S2 Pro (fish_qwen3_omni).

Wraps text_config (Slow AR) and audio_decoder_config (Fast AR).

audio_decoder_config instance-attribute

audio_decoder_config = (
    audio_decoder_config or FishSpeechFastARConfig()
)

audio_pad_token_id instance-attribute

audio_pad_token_id = audio_pad_token_id

model_type class-attribute instance-attribute

model_type = 'fish_qwen3_omni'

semantic_end_token_id instance-attribute

semantic_end_token_id = semantic_end_token_id

semantic_start_token_id instance-attribute

semantic_start_token_id = semantic_start_token_id

sub_configs class-attribute instance-attribute

sub_configs = {
    "text_config": FishSpeechSlowARConfig,
    "audio_decoder_config": FishSpeechFastARConfig,
}

text_config instance-attribute

text_config = text_config or FishSpeechSlowARConfig()

get_text_config

get_text_config(**kwargs) -> FishSpeechSlowARConfig

FishSpeechFastARConfig

Bases: PretrainedConfig

Fast AR (audio_decoder) config -- 4-layer residual codebook predictor.

attention_bias instance-attribute

attention_bias = False

attention_qk_norm instance-attribute

attention_qk_norm = attention_qk_norm

audio_hidden_dim instance-attribute

audio_hidden_dim = audio_hidden_dim

head_dim instance-attribute

head_dim = head_dim

hidden_act instance-attribute

hidden_act = 'silu'

hidden_size instance-attribute

hidden_size = dim

intermediate_size instance-attribute

intermediate_size = intermediate_size

max_position_embeddings instance-attribute

max_position_embeddings = max_seq_len

model_type class-attribute instance-attribute

model_type = 'fish_qwen3_audio_decoder'

num_attention_heads instance-attribute

num_attention_heads = n_head

num_codebooks instance-attribute

num_codebooks = num_codebooks

num_hidden_layers instance-attribute

num_hidden_layers = n_layer

num_key_value_heads instance-attribute

num_key_value_heads = n_local_heads

rms_norm_eps instance-attribute

rms_norm_eps = rms_norm_eps

rope_theta instance-attribute

rope_theta = rope_base

text_dim instance-attribute

text_dim = text_dim

FishSpeechSlowARConfig

Bases: PretrainedConfig

Slow AR (text_model) config -- Qwen3-based transformer.

Maps Fish Speech field names to Qwen3-compatible attribute names so vllm.model_executor.models.qwen3.Qwen3Model works out of the box.

attention_bias instance-attribute

attention_bias = False

attention_qk_norm instance-attribute

attention_qk_norm = attention_qk_norm

codebook_size instance-attribute

codebook_size = codebook_size

head_dim instance-attribute

head_dim = head_dim

hidden_act instance-attribute

hidden_act = 'silu'

hidden_size instance-attribute

hidden_size = dim

intermediate_size instance-attribute

intermediate_size = intermediate_size

max_position_embeddings instance-attribute

max_position_embeddings = max_seq_len

model_type class-attribute instance-attribute

model_type = 'fish_qwen3'

num_attention_heads instance-attribute

num_attention_heads = n_head

num_codebooks instance-attribute

num_codebooks = num_codebooks

num_hidden_layers instance-attribute

num_hidden_layers = n_layer

num_key_value_heads instance-attribute

num_key_value_heads = n_local_heads

rms_norm_eps instance-attribute

rms_norm_eps = rms_norm_eps

rope_theta instance-attribute

rope_theta = rope_base

scale_codebook_embeddings instance-attribute

scale_codebook_embeddings = scale_codebook_embeddings

semantic_begin_id instance-attribute

semantic_begin_id = semantic_begin_id

semantic_end_id instance-attribute

semantic_end_id = semantic_end_id