Skip to content

vllm_omni.model_executor.models.indextts2.pipeline

IndexTTS2 pipeline: GPT AR talker (text → mel codes) → S2Mel + BigVGAN (mel → audio).

Two-stage non-streaming pipeline. S2Mel flow matching (25 Euler steps) requires the full mel code sequence, so async_chunk is not applicable.

INDEXTTS2_PIPELINE module-attribute

INDEXTTS2_PIPELINE = PipelineConfig(
    model_type="indextts2",
    model_arch="IndexTTS2TalkerForConditionalGeneration",
    stages=(
        StagePipelineConfig(
            stage_id=0,
            model_stage="indextts2_talker",
            execution_type=StageExecutionType.LLM_AR,
            input_sources=(),
            owns_tokenizer=True,
            engine_output_type="latent",
            extras={
                "skip_tokenizer_init": True,
                "tokenizer": "gpt2",
            },
            custom_process_next_stage_input_func=f"{_PROC}.talker2s2mel_full_payload",
            sampling_constraints={
                "detokenize": False,
                "stop_token_ids": [8193],
            },
        ),
        StagePipelineConfig(
            stage_id=1,
            model_stage="indextts2_s2mel_decoder",
            execution_type=StageExecutionType.LLM_GENERATION,
            input_sources=(0,),
            final_output=True,
            final_output_type="audio",
            engine_output_type="audio",
            model_arch="IndexTTS2S2MelDecoder",
            sync_process_input_func=f"{_PROC}.talker2s2mel_token_only",
            extras={
                "skip_tokenizer_init": True,
                "tokenizer": "gpt2",
            },
            sampling_constraints={"detokenize": True},
        ),
    ),
)