Dynamic Chunked Pipeline Parallel#
Note
For design details and mathematical models, see Design Document. For deployment tutorial, see Dynamic Chunked Pipeline Parallel Tutorial.
Overview#
Dynamic Chunked Pipeline Parallel (CPP) is a profiling-based dynamic chunking strategy that optimizes prefill performance for long sequences in Pipeline Parallelism (PP) scenarios.
When to Use#
Variable-length sequence serving: PP does not introduce degradation on short sequences, and gains benefits through dynamic chunks on long sequences.
Ultra-long sequence inference: For sequences exceeding single-machine memory capacity (e.g., 1M tokens), dynamic chunking significantly reduces pipeline idle time.
Supported Scenarios#
Currently CPP mainly focuses on optimization during the prefill phase. It is better to be used in PD disaggregation scenarios. Supported features are as follows:
Eager |
Graph |
Prefix |
Chunked |
|
|---|---|---|---|---|
CPP |
✅ |
✅ |
✅ |
✅ |
How to Enable#
Online Serving#
vllm serve <model_path> \
--pipeline-parallel-size 2 \
--enable-chunked-prefill \
--additional-config '{"profiling_chunk_config": {"enabled": true}}'
Offline Inference#
from vllm import LLM
llm = LLM(
model="<model_path>",
pipeline_parallel_size=2,
additional_config={"profiling_chunk_config": {"enabled": True}},
)
Configuration Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
False |
Enable/disable Dynamic Chunked Pipeline Parallel |
|
float |
1.0 |
Smoothing factor (0 < x ≤ 1.0). Higher values trust dynamic prediction more |
|
int |
4096 |
Minimum chunk size for dynamic calculation |
|
bool |
True |
Enable/disable Online Calibration |
Parameter Tuning#
smooth_factor: Controls trust level in dynamic prediction1.0: Strictly follow model prediction0.6~0.85: Balance dynamic adjustment and scheduling overhead0.0: No dynamic adjustment (degrades to fixed chunking)
min_chunk: Generally doesn’t need adjustment. Should be smaller thanmax-num-batched-tokens
Recommended Settings#
max-num-batched-tokens#
Notably, the TTFT of CPP is very sensitive to max-num-batched-tokens (considered the initial chunksize for dynamic solving). Because if it is too large, it will introduce si
gnificant computational voids, and if it is too small, it will lead to a decrease in operator efficiency. To leave enough room for dynamic adjustments, we recommend that the longer the sequence being processed, the larger the max-num-batched-tokens should be set. Recommended values:
Sequence Length |
|
|---|---|
64k |
20480 |
128k |
32768 |
Online Calibration#
For optimal performance, online calibrate with real data before production:
You can use aisbench to generate fixed-length random datasets. Refer to Using AISBench for performance evaluation for details.
Modify
<YOUR_AISBENCH_PATH>/benchmark/ais_bench/datasets/synthetic/synthetic_config.py:
synthetic_config = {
"Type": "string",
"RequestCount": 5,
"TrustRemoteCode": False,
"StringConfig": {
"Input": {
"Method": "uniform",
"Params": {"MinValue": 131072, "MaxValue": 131072} # Your max sequence length, max-model-len
},
"Output": {
"Method": "uniform",
"Params": {"MinValue": 1, "MaxValue": 1}
}
},
}
Run for online calibration:
ais_bench --models vllm_api_stream_chat --datasets synthetic_gen --mode perf --debug
Configure online calibration data length to match your max-model-len. Use batch_size=1 and ensure data differs to avoid cache hits if prefix caching is enabled.
Performance#
Refer to Using AISBench for performance evaluation for details.
To evaluate the effectiveness of Dynamic Chunked Pipeline Parallel in long sequence LLM inference scenarios, we use DeepSeek-V3.1-W8A8 and Qwen3-235B, deploy P instance in Ascend Atlas A3 inference products*64G (A3), the configuration and performance data are as follows.
Fixed-length requests, concurrency=1:
DeepSeek-V3.1-W8A8:
Configuration
CPP
(Dynamic Chunk,
chunksize=32k)PP
(Static Chunk,
chunksize=32k)Input length 128k
TTFT: 22.5s
TTFT: 27.0s
Qwen3-235B:
Configuration
CPP
(Dynamic Chunk,
chunksize=32k)PP
(Static Chunk,
chunksize=32k)Input length 256k
TTFT: 53.5s
TTFT: 61.4s
Variable-length requests, concurrency=4:
DeepSeek-V3.1-W8A8:
Configuration
4k~64k Input, mean=32k, std=32k
prefix hit rate=99%CPP2TP8
Input throughput: 22424 tps/card
DP2TP8
Input throughput: 16150 tps/card
PCP2TP8
Input throughput: 18197 tps/card
TP16
Input throughput: 18875 tps/card
Constraints#
Pipeline Parallelism Required:
--pipeline-parallel-size > 1Chunked Prefill Required:
--enable-chunked-prefillIncompatible with Balance Scheduling: Cannot enable
VLLM_ASCEND_BALANCE_SCHEDULINGStartup Overhead: Profiling adds ~64 forward passes (tens of seconds)