Dynamic Chunked Pipeline Parallel (CPP)#
TL;DR CPP uses profiling-based dynamic chunking to equalize per-chunk latency and eliminate pipeline bubbles in PP scenarios.
Background#
Problem Statement#
In Pipeline Parallelism (PP) + Chunked Prefill scenarios, long sequences are split into fixed-size chunks that pass through the pipeline sequentially. Due to the O(n²) computational complexity of Self-Attention, chunks of the same size take increasingly longer to process as the prefix sequence grows:
Chunk 1 (history=0): ██████ → Time T1
Chunk 2 (history=4K): ████████ → Time T2 > T1
Chunk 3 (history=8K): ██████████ → Time T3 > T2
Chunk 4 (history=12K): ████████████ → Time T4 > T3
This time variance propagates across pipeline stages, causing increased idle waiting (Pipeline Bubble) and significantly reducing GPU utilization.
Solution Overview#
Dynamic Chunked Pipeline Parallel uses a profile-first, then predict strategy:
Fixed Chunking (equal chunk size, unequal time):
Stage 0 |■■■■|■■■■■■|■■■■■■■■|■■■■■■■■■■|
Stage 1 | |■■■■ |■■■■■■ |■■■■■■■■ |■■■■■■■■■■|
↑ bubble ↑ bubble ↑ bubble
Dynamic Chunking (unequal chunk size, equal time):
Stage 0 |■■■■■■|■■■■■■|■■■■■■|■■■■■■|
Stage 1 | |■■■■■■|■■■■■■|■■■■■■|■■■■■■|
↑ no bubble — stages stay in sync
The core idea is borrowed from SGLang’s dynamic chunking mechanism, with additional enhancements such as online calibration.
Design#
Quadratic Latency Model#
Transformer prefill latency grows quadratically with sequence length due to the O(n²) Self-Attention mechanism:
Where:
\(a \cdot l^2\): Attention overhead (quadratic)
\(b \cdot l\): Linear operations (FFN, projection)
\(c\): Fixed overhead (kernel launch)
Startup Phase: Profiling#
During engine initialization, the system profiles actual model performance:
Sampling: Uniformly sample 64 different chunk sizes from
base_chunk_sizedown to near 0Execution: Perform real model forward passes for each chunk size and precisely measure latency (milliseconds)
Fitting: Fit the quadratic model using least squares
Target Setting: Calculate target per-chunk latency based on
base_chunk_size
In PP mode, all workers execute forward passes to stay synchronized, but only the first PP rank’s timing results are used for scheduling decisions.
Runtime Phase: Dynamic Prediction#
Given current prefix length \(L\) and target latency \(T = f(\text{base\_chunk\_size}) - f(0)\), the system solves for the next chunk size \(x\):
Expanding to:
Solved using the quadratic formula:
The result goes through post-processing:
Smoothing: Blend predicted chunk size with
base_chunk_sizeusingsmooth_factorAlignment: Round down to multiple of
page_size(minimum 64)Constraints: Not exceeding
max_model_len - history_lenandmax_num_scheduled_tokens
Online Calibration#
Since profiling only covers sequences up to max_num_batched_tokens (typically shorter than real workloads), the system continuously refines the model at runtime.
Extended Model (two variables):
Where \(C\) is chunk size and \(H\) is prefix history length.
After each batch, feature vectors [Σ(C+H)·C, Σ(C+H), N] and actual execution time are recorded. Once enough data points accumulate (5-30), model parameters are updated using least squares.
Architecture#
Key Components#
Component |
Location |
Responsibility |
|---|---|---|
ChunkSizePredictor |
|
Quadratic model fitting and prediction |
ProfilingChunkManager |
|
Manage profiling workflow and predictor |
Scheduler |
|
Integrate CPP scheduling |
EngineCore |
|
Startup profiling, record execution time |
NPUWorker |
|
Execute real forward pass profiling |
NPUModelRunner |
|
|
Workflow#
┌─────────────────────────────────────────────────────────────┐
│ Startup Phase │
├─────────────────────────────────────────────────────────────┤
│ 1. EngineCore.init() triggers profiling │
│ 2. ProfilingChunkManager samples 64 chunk sizes │
│ 3. NPUWorker executes forward passes │
│ 4. ChunkSizePredictor fits quadratic model │
│ 5. Target latency = f(base_chunk_size) - f(0) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Runtime Phase │
├─────────────────────────────────────────────────────────────┤
│ For each prefill chunk: │
│ 1. Scheduler queries ChunkSizePredictor │
│ 2. Given history length L, solve for optimal chunk size │
│ 3. Apply smoothing and alignment │
│ 4. Execute chunk │
│ 5. Record actual timing for online calibration │
│ 6. Update model if enough samples collected │
└─────────────────────────────────────────────────────────────┘
Comparison with SGLang#
Feature |
SGLang Dynamic Chunking |
Dynamic Chunked Pipeline Parallel |
|---|---|---|
Profiling method |
Preset quadratic function |
Real forward pass profiling at startup |
Model fitting |
\(f(l) = a \cdot l^2 + b \cdot l + c\) |
Same + online calibration \(f(C,H)\) |
Online updates |
None |
History-based fitting |
Accuracy |
May deviate on different hardware |
Adapts to actual hardware performance |
Startup cost |
None |
~64 forward passes (tens of seconds) |
Constraints#
Pipeline Parallelism Required: Must set
--pipeline-parallel-size > 1Chunked Prefill Required: Must enable
--enable-chunked-prefillIncompatible with Balance Scheduling: Cannot enable
VLLM_ASCEND_BALANCE_SCHEDULINGStartup Overhead: Profiling phase adds tens of seconds to initialization
Memory: No additional runtime memory overhead; profiling reuses existing dummy_run mechanism