Dynamic Chunked Pipeline Parallel#

Note

For design details and mathematical models, see Design Document. For deployment tutorial, see Dynamic Chunked Pipeline Parallel Tutorial.

Overview#

Dynamic Chunked Pipeline Parallel (CPP) is a profiling-based dynamic chunking strategy that optimizes prefill performance for long sequences in Pipeline Parallelism (PP) scenarios.

When to Use#

  • Variable-length sequence serving: PP does not introduce degradation on short sequences, and gains benefits through dynamic chunks on long sequences.

  • Ultra-long sequence inference: For sequences exceeding single-machine memory capacity (e.g., 1M tokens), dynamic chunking significantly reduces pipeline idle time.

Supported Scenarios#

Currently CPP mainly focuses on optimization during the prefill phase. It is better to be used in PD disaggregation scenarios. Supported features are as follows:

Eager

Graph

Prefix
Cache

Chunked
Prefill

CPP

How to Enable#

Online Serving#

vllm serve <model_path> \
    --pipeline-parallel-size 2 \
    --enable-chunked-prefill \
    --additional-config '{"profiling_chunk_config": {"enabled": true}}'

Offline Inference#

from vllm import LLM

llm = LLM(
    model="<model_path>",
    pipeline_parallel_size=2,
    additional_config={"profiling_chunk_config": {"enabled": True}},
)

Configuration Parameters#

Parameter

Type

Default

Description

enabled

bool

False

Enable/disable Dynamic Chunked Pipeline Parallel

smooth_factor

float

1.0

Smoothing factor (0 < x ≤ 1.0). Higher values trust dynamic prediction more

min_chunk

int

4096

Minimum chunk size for dynamic calculation

need_timing

bool

True

Enable/disable Online Calibration

Parameter Tuning#

  • smooth_factor: Controls trust level in dynamic prediction

    • 1.0: Strictly follow model prediction

    • 0.6~0.85: Balance dynamic adjustment and scheduling overhead

    • 0.0: No dynamic adjustment (degrades to fixed chunking)

  • min_chunk: Generally doesn’t need adjustment. Should be smaller than max-num-batched-tokens

Performance#

Refer to Using AISBench for performance evaluation for details.

To evaluate the effectiveness of Dynamic Chunked Pipeline Parallel in long sequence LLM inference scenarios, we use DeepSeek-V3.1-W8A8 and Qwen3-235B, deploy P instance in Ascend Atlas A3 inference products*64G (A3), the configuration and performance data are as follows.

Fixed-length requests, concurrency=1:

  • DeepSeek-V3.1-W8A8:

    Configuration

    CPP
    (Dynamic Chunk,
    chunksize=32k)

    PP
    (Static Chunk,
    chunksize=32k)

    Input length 128k

    TTFT: 22.5s

    TTFT: 27.0s

  • Qwen3-235B:

    Configuration

    CPP
    (Dynamic Chunk,
    chunksize=32k)

    PP
    (Static Chunk,
    chunksize=32k)

    Input length 256k

    TTFT: 53.5s

    TTFT: 61.4s

Variable-length requests, concurrency=4:

  • DeepSeek-V3.1-W8A8:

    Configuration

    4k~64k Input, mean=32k, std=32k
    prefix hit rate=99%

    CPP2TP8

    Input throughput: 22424 tps/card

    DP2TP8

    Input throughput: 16150 tps/card

    PCP2TP8

    Input throughput: 18197 tps/card

    TP16

    Input throughput: 18875 tps/card

Constraints#

  • Pipeline Parallelism Required: --pipeline-parallel-size > 1

  • Chunked Prefill Required: --enable-chunked-prefill

  • Incompatible with Balance Scheduling: Cannot enable VLLM_ASCEND_BALANCE_SCHEDULING

  • Startup Overhead: Profiling adds ~64 forward passes (tens of seconds)