Skip to content

Prefill-Decode (PD) Disaggregation

PD disaggregation splits the Qwen3-Omni thinker into separate prefill and decode stages so prompt processing and token generation can run on different workers.

This is documented as a stage-config recipe instead of a bundled YAML because the deployment-specific values usually change per environment:

  • GPU placement
  • tensor_parallel_size
  • connector backend and connector ports
  • connector IPs or bootstrap addresses

Start from the default Qwen3-Omni stage config and copy it to your own file, for example qwen3_omni_pd.yaml. Then apply the changes below.

Requirements

  • 3+ GPUs for a basic layout: prefill, decode, and talker+code2wav
  • A KV connector supported by vLLM, such as MooncakeConnector
  • Matching tensor_parallel_size on the prefill and decode thinker stages

1. Split the thinker into prefill and decode stages

Replace the original thinker stage with two stages:

stage_args:
  - stage_id: 0
    stage_type: llm
    is_prefill_only: true
    runtime:
      devices: "0"
    engine_args:
      max_num_seqs: 16
      model_stage: thinker
      model_arch: Qwen3OmniMoeForConditionalGeneration
      worker_type: ar
      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
      gpu_memory_utilization: 0.9
      enforce_eager: true
      trust_remote_code: true
      engine_output_type: latent
      distributed_executor_backend: "mp"
      enable_prefix_caching: false
      max_num_batched_tokens: 32768
      hf_config_name: thinker_config
      tensor_parallel_size: 1
      kv_transfer_config:
        kv_connector: "MooncakeConnector"
        kv_role: "kv_producer"
        kv_rank: 0
        kv_parallel_size: 2
        kv_connector_extra_config:
          mooncake_bootstrap_port: 25201
    final_output: false
    is_comprehension: true
    default_sampling_params:
      temperature: 0.4
      top_p: 0.9
      top_k: 1
      max_tokens: 2048
      seed: 42
      detokenize: True
      repetition_penalty: 1.05

  - stage_id: 1
    stage_type: llm
    is_decode_only: true
    runtime:
      devices: "1"
    engine_args:
      max_num_seqs: 64
      model_stage: thinker
      model_arch: Qwen3OmniMoeForConditionalGeneration
      worker_type: ar
      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
      gpu_memory_utilization: 0.9
      enforce_eager: true
      trust_remote_code: true
      engine_output_type: latent
      distributed_executor_backend: "mp"
      enable_prefix_caching: false
      max_num_batched_tokens: 32768
      hf_config_name: thinker_config
      tensor_parallel_size: 1
      kv_transfer_config:
        kv_connector: "MooncakeConnector"
        kv_role: "kv_consumer"
        kv_rank: 1
        kv_parallel_size: 2
        kv_connector_extra_config:
          mooncake_bootstrap_port: 25202
    engine_input_source: [0]
    final_output: true
    final_output_type: text
    is_comprehension: true
    default_sampling_params:
      temperature: 0.4
      top_p: 0.9
      top_k: 1
      max_tokens: 2048
      seed: 42
      detokenize: True
      repetition_penalty: 1.05

Notes:

  • is_prefill_only: true marks the thinker stage that only saves KV.
  • is_decode_only: true marks the thinker stage that resumes from remote KV.
  • kv_transfer_config is required on both stages.
  • The orchestrator forces the prefill stage to run with max_tokens=1, so the prefill side only processes the prompt and exports KV.

2. Shift the downstream stages by one index

After inserting the extra thinker stage, renumber the remaining stages:

  - stage_id: 2
    runtime:
      devices: "2"
    engine_input_source: [1]
    custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker

  - stage_id: 3
    runtime:
      devices: "2"
    engine_args:
      max_num_seqs: 1
    engine_input_source: [2]
    custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav

Compared with the default Qwen3-Omni config:

  • the talker becomes stage 2 instead of stage 1
  • the code2wav stage becomes stage 3 instead of stage 2
  • the talker now reads from decode stage 1

3. Add runtime edges for the four-stage pipeline

runtime:
  enabled: true
  edges:
    - from: 0
      to: 1
    - from: 1
      to: 2
    - from: 2
      to: 3

4. Launch with your custom config

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
    --stage-configs-path /path/to/qwen3_omni_pd.yaml

Operational Notes

  • MooncakeConnector does not support heterogeneous TP sizes across the PD pair. Keep prefill and decode at the same tensor_parallel_size.
  • If the thinker requires TP=2, both thinker stages must use TP=2 and be given separate GPU sets, for example "0,1" for prefill and "2,3" for decode.
  • Choose connector ports and addresses that match your deployment. The values shown above are examples only.