Dynamic Chunked Pipeline Parallel#
Note
For design details and mathematical models, see Design Document. For deployment tutorial, see Dynamic Chunked Pipeline Parallel Tutorial.
Overview#
Dynamic Chunked Pipeline Parallel (CPP) is a profiling-based dynamic chunking strategy that optimizes prefill performance for long sequences in Pipeline Parallelism (PP) scenarios.
When to Use#
Variable-length sequence serving: PP does not introduce degradation on short sequences, and gains benefits through dynamic chunks on long sequences.
Ultra-long sequence inference: For sequences exceeding single-machine memory capacity (e.g., 1M tokens), dynamic chunking significantly reduces pipeline idle time.
Supported Scenarios#
Currently CPP mainly focuses on optimization during the prefill phase. It is better to be used in PD disaggregation scenarios. Supported features are as follows:
Eager |
Graph |
Prefix |
Chunked |
|
|---|---|---|---|---|
CPP |
✅ |
✅ |
✅ |
✅ |
How to Enable#
Online Serving#
vllm serve <model_path> \
--pipeline-parallel-size 2 \
--enable-chunked-prefill \
--additional-config '{"profiling_chunk_config": {"enabled": true}}'
Offline Inference#
from vllm import LLM
llm = LLM(
model="<model_path>",
pipeline_parallel_size=2,
additional_config={"profiling_chunk_config": {"enabled": True}},
)
Configuration Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
False |
Enable/disable Dynamic Chunked Pipeline Parallel |
|
float |
1.0 |
Smoothing factor (0 < x ≤ 1.0). Higher values trust dynamic prediction more |
|
int |
4096 |
Minimum chunk size for dynamic calculation |
|
bool |
True |
Enable/disable Online Calibration |
|
int |
30 |
Number of chunk-time data for Online Calibration |
Parameter Tuning#
smooth_factor: Controls trust level in dynamic prediction1.0: Strictly follow model prediction0.6~0.85: Balance dynamic adjustment and scheduling overhead0.0: No dynamic adjustment (degrades to fixed chunking)
min_chunk: Generally doesn’t need adjustment. Should be smaller thanmax-num-batched-tokens
Recommended Settings#
max-num-batched-tokens#
Notably, the TTFT of CPP is very sensitive to max-num-batched-tokens (considered the initial chunksize for dynamic solving). Because if it is too large, it will introduce significant computational voids, and if it is too small, it will lead to a decrease in operator efficiency. To leave enough room for dynamic adjustments, we recommend that the longer the sequence being processed, the larger the max-num-batched-tokens should be set. Recommended values:
Sequence Length |
|
|---|---|
64k |
20480 |
128k |
32768 |
Online Calibration#
For optimal performance, online calibrate with real data before production:
You can use aisbench to generate fixed-length random datasets. Refer to Using AISBench for performance evaluation for details.
Modify
<YOUR_AISBENCH_PATH>/benchmark/ais_bench/datasets/synthetic/synthetic_config.py:synthetic_config = { "Type": "string", "RequestCount": 5, "TrustRemoteCode": False, "StringConfig": { "Input": { "Method": "uniform", "Params": {"MinValue": 131072, "MaxValue": 131072} # Your max sequence length, max-model-len }, "Output": { "Method": "uniform", "Params": {"MinValue": 1, "MaxValue": 1} } }, }
Run for online calibration:
ais_bench --models vllm_api_stream_chat --datasets synthetic_gen --mode perf --debug
Configure online calibration data length to match your max-model-len. Use batch_size=1 and ensure data differs to avoid cache hits if prefix caching is enabled.
Performance#
Refer to Using AISBench for performance evaluation for details.
To evaluate the effectiveness of Dynamic Chunked Pipeline Parallel in long sequence LLM inference scenarios, we use DeepSeek-V3.1-W8A8 and Qwen3-235B, deploy P instance in Ascend Atlas A3 inference products*64G (A3), the configuration and performance data are as follows.
Fixed-length requests, concurrency=1:
DeepSeek-V3.1-W8A8:
Configuration
CPP
(Dynamic Chunk,
chunksize=32k)PP
(Static Chunk,
chunksize=32k)Input length 128k
TTFT: 22.5s
TTFT: 27.0s
Qwen3-235B:
Configuration
CPP
(Dynamic Chunk,
chunksize=32k)PP
(Static Chunk,
chunksize=32k)Input length 256k
TTFT: 53.5s
TTFT: 61.4s
Variable-length requests, concurrency=4:
DeepSeek-V3.1-W8A8:
Configuration
4k~64k Input, mean=32k, std=32k
prefix hit rate=99%CPP2TP8
Input throughput: 22424 tps/card
DP2TP8
Input throughput: 16150 tps/card
PCP2TP8
Input throughput: 18197 tps/card
TP16
Input throughput: 18875 tps/card
Constraints#
Pipeline Parallelism Required:
--pipeline-parallel-size > 1Chunked Prefill Required:
--enable-chunked-prefillIncompatible with Balance Scheduling: Cannot enable
VLLM_ASCEND_BALANCE_SCHEDULINGStartup Overhead: Profiling adds ~64 forward passes (tens of seconds)