Dynamic Chunked Pipeline Parallel

Dynamic Chunked Pipeline Parallel#

Note

For design details and mathematical models, see Design Document. For deployment tutorial, see Dynamic Chunked Pipeline Parallel Tutorial.

Overview#

Dynamic Chunked Pipeline Parallel (CPP) is a profiling-based dynamic chunking strategy that optimizes prefill performance for long sequences in Pipeline Parallelism (PP) scenarios.

When to Use#

Variable-length sequence serving: PP does not introduce degradation on short sequences, and gains benefits through dynamic chunks on long sequences.
Ultra-long sequence inference: For sequences exceeding single-machine memory capacity (e.g., 1M tokens), dynamic chunking significantly reduces pipeline idle time.

Supported Scenarios#

Currently CPP mainly focuses on optimization during the prefill phase. It is better to be used in PD disaggregation scenarios. Supported features are as follows:

	Eager	Graph	Prefix Cache	Chunked Prefill
CPP	✅	✅	✅	✅

How to Enable#

Online Serving#

vllm serve <model_path> \
    --pipeline-parallel-size 2 \
    --enable-chunked-prefill \
    --additional-config '{"profiling_chunk_config": {"enabled": true}}'

Offline Inference#

from vllm import LLM

llm = LLM(
    model="<model_path>",
    pipeline_parallel_size=2,
    additional_config={"profiling_chunk_config": {"enabled": True}},
)

Configuration Parameters#

Parameter	Type	Default	Description
`enabled`	bool	False	Enable/disable Dynamic Chunked Pipeline Parallel
`smooth_factor`	float	1.0	Smoothing factor (0 < x ≤ 1.0). Higher values trust dynamic prediction more
`min_chunk`	int	4096	Minimum chunk size for dynamic calculation
`need_timing`	bool	True	Enable/disable Online Calibration

Parameter Tuning#

smooth_factor: Controls trust level in dynamic prediction
- 1.0: Strictly follow model prediction
- 0.6~0.85: Balance dynamic adjustment and scheduling overhead
- 0.0: No dynamic adjustment (degrades to fixed chunking)
min_chunk: Generally doesn’t need adjustment. Should be smaller than max-num-batched-tokens

Recommended Settings#

max-num-batched-tokens#

Notably, the TTFT of CPP is very sensitive to max-num-batched-tokens (considered the initial chunksize for dynamic solving). Because if it is too large, it will introduce significant computational voids, and if it is too small, it will lead to a decrease in operator efficiency. To leave enough room for dynamic adjustments, we recommend that the longer the sequence being processed, the larger the max-num-batched-tokens should be set. Recommended values:

Sequence Length	`max-num-batched-tokens`
64k	20480
128k	32768

Online Calibration#

For optimal performance, online calibrate with real data before production:

You can use aisbench to generate fixed-length random datasets. Refer to Using AISBench for performance evaluation for details.

Modify <YOUR_AISBENCH_PATH>/benchmark/ais_bench/datasets/synthetic/synthetic_config.py:

synthetic_config = {
    "Type": "string",
    "RequestCount": 5,
    "TrustRemoteCode": False,
    "StringConfig": {
        "Input": {
            "Method": "uniform",
            "Params": {"MinValue": 131072, "MaxValue": 131072}  # Your max sequence length, max-model-len
        },
        "Output": {
            "Method": "uniform",
            "Params": {"MinValue": 1, "MaxValue": 1}
        }
    },
}

Run for online calibration:

ais_bench --models vllm_api_stream_chat --datasets synthetic_gen --mode perf --debug

Configure online calibration data length to match your max-model-len. Use batch_size=1 and ensure data differs to avoid cache hits if prefix caching is enabled.

Performance#

Refer to Using AISBench for performance evaluation for details.

To evaluate the effectiveness of Dynamic Chunked Pipeline Parallel in long sequence LLM inference scenarios, we use DeepSeek-V3.1-W8A8 and Qwen3-235B, deploy P instance in Ascend Atlas A3 inference products*64G (A3), the configuration and performance data are as follows.

Fixed-length requests, concurrency=1:

DeepSeek-V3.1-W8A8:

Configuration

CPP
(Dynamic Chunk,
chunksize=32k)

PP
(Static Chunk,
chunksize=32k)

Input length 128k

TTFT: 22.5s

TTFT: 27.0s
Qwen3-235B:

Configuration

CPP
(Dynamic Chunk,
chunksize=32k)

PP
(Static Chunk,
chunksize=32k)

Input length 256k

TTFT: 53.5s

TTFT: 61.4s

Variable-length requests, concurrency=4:

DeepSeek-V3.1-W8A8:

Configuration	4k~64k Input, mean=32k, std=32k prefix hit rate=99%
CPP2TP8	Input throughput: 22424 tps/card
DP2TP8	Input throughput: 16150 tps/card
PCP2TP8	Input throughput: 18197 tps/card
TP16	Input throughput: 18875 tps/card

Constraints#

Pipeline Parallelism Required: --pipeline-parallel-size > 1
Chunked Prefill Required: --enable-chunked-prefill
Incompatible with Balance Scheduling: Cannot enable VLLM_ASCEND_BALANCE_SCHEDULING
Startup Overhead: Profiling adds ~64 forward passes (tens of seconds)