Skip to content

vllm_omni.model_executor.models.fish_speech.pipeline

Fish Speech S2 Pro pipeline topology (frozen).

Stage 0: slow_ar — text → RVQ codec tokens (LLM autoregressive). Stage 1: dac_decoder — RVQ tokens → audio waveform (LLM_GENERATION).

The HF config top-level reports model_type = "fish_qwen3_omni" (the OmniConfig that bundles slow-AR and fast-AR sub-configs), which is why the registry key follows the HF name rather than the human-readable "fish_speech".

FISH_SPEECH_PIPELINE module-attribute

FISH_SPEECH_PIPELINE = PipelineConfig(
    model_type="fish_qwen3_omni",
    model_arch="FishSpeechSlowARForConditionalGeneration",
    stages=(
        StagePipelineConfig(
            stage_id=0,
            model_stage="fish_speech_slow_ar",
            execution_type=LLM_AR,
            input_sources=(),
            owns_tokenizer=True,
            engine_output_type="latent",
            async_chunk_process_next_stage_input_func=f"{_PROC}.slow_ar_to_dac_decoder_async_chunk",
            sampling_constraints={
                "detokenize": False,
                "stop_token_ids": [151645],
            },
        ),
        StagePipelineConfig(
            stage_id=1,
            model_stage="dac_decoder",
            model_arch="FishSpeechDACDecoder",
            execution_type=LLM_GENERATION,
            input_sources=(0,),
            final_output=True,
            final_output_type="audio",
            engine_output_type="audio",
            sampling_constraints={"detokenize": True},
        ),
    ),
)