Prefill-Decode (PD) Disaggregation¶
PD disaggregation splits the Qwen3-Omni thinker into separate prefill and decode stages so prompt processing and token generation can run on different workers.
This is documented as a stage-config recipe instead of a bundled YAML because the deployment-specific values usually change per environment:
- GPU placement
tensor_parallel_size- connector backend and connector ports
- connector IPs or bootstrap addresses
Start from the default Qwen3-Omni stage config and copy it to your own file, for example qwen3_omni_pd.yaml. Then apply the changes below.
Requirements¶
- 3+ GPUs for a basic layout: prefill, decode, and talker+code2wav
- A KV connector supported by vLLM, such as
MooncakeConnector - Matching
tensor_parallel_sizeon the prefill and decode thinker stages
1. Split the thinker into prefill and decode stages¶
Replace the original thinker stage with two stages:
stage_args:
- stage_id: 0
stage_type: llm
is_prefill_only: true
runtime:
devices: "0"
engine_args:
max_num_seqs: 16
model_stage: thinker
model_arch: Qwen3OmniMoeForConditionalGeneration
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
gpu_memory_utilization: 0.9
enforce_eager: true
trust_remote_code: true
engine_output_type: latent
distributed_executor_backend: "mp"
enable_prefix_caching: false
max_num_batched_tokens: 32768
hf_config_name: thinker_config
tensor_parallel_size: 1
kv_transfer_config:
kv_connector: "MooncakeConnector"
kv_role: "kv_producer"
kv_rank: 0
kv_parallel_size: 2
kv_connector_extra_config:
mooncake_bootstrap_port: 25201
final_output: false
is_comprehension: true
default_sampling_params:
temperature: 0.4
top_p: 0.9
top_k: 1
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.05
- stage_id: 1
stage_type: llm
is_decode_only: true
runtime:
devices: "1"
engine_args:
max_num_seqs: 64
model_stage: thinker
model_arch: Qwen3OmniMoeForConditionalGeneration
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
gpu_memory_utilization: 0.9
enforce_eager: true
trust_remote_code: true
engine_output_type: latent
distributed_executor_backend: "mp"
enable_prefix_caching: false
max_num_batched_tokens: 32768
hf_config_name: thinker_config
tensor_parallel_size: 1
kv_transfer_config:
kv_connector: "MooncakeConnector"
kv_role: "kv_consumer"
kv_rank: 1
kv_parallel_size: 2
kv_connector_extra_config:
mooncake_bootstrap_port: 25202
engine_input_source: [0]
final_output: true
final_output_type: text
is_comprehension: true
default_sampling_params:
temperature: 0.4
top_p: 0.9
top_k: 1
max_tokens: 2048
seed: 42
detokenize: True
repetition_penalty: 1.05
Notes:
is_prefill_only: truemarks the thinker stage that only saves KV.is_decode_only: truemarks the thinker stage that resumes from remote KV.kv_transfer_configis required on both stages.- The orchestrator forces the prefill stage to run with
max_tokens=1, so the prefill side only processes the prompt and exports KV.
2. Shift the downstream stages by one index¶
After inserting the extra thinker stage, renumber the remaining stages:
- stage_id: 2
runtime:
devices: "2"
engine_input_source: [1]
custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker
- stage_id: 3
runtime:
devices: "2"
engine_args:
max_num_seqs: 1
engine_input_source: [2]
custom_process_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav
Compared with the default Qwen3-Omni config:
- the talker becomes stage
2instead of stage1 - the code2wav stage becomes stage
3instead of stage2 - the talker now reads from decode stage
1
3. Add runtime edges for the four-stage pipeline¶
4. Launch with your custom config¶
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--stage-configs-path /path/to/qwen3_omni_pd.yaml
Operational Notes¶
MooncakeConnectordoes not support heterogeneous TP sizes across the PD pair. Keep prefill and decode at the sametensor_parallel_size.- If the thinker requires TP=2, both thinker stages must use TP=2 and be given separate GPU sets, for example
"0,1"for prefill and"2,3"for decode. - Choose connector ports and addresses that match your deployment. The values shown above are examples only.