Row-Parallel Chunking

Row-Parallel Chunking is an optimization method for tensor-parallel inference on Intel® Gaudi® that overlaps computation with communication in RowParallelLinear layers. In standard tensor-parallel inference, each RowParallelLinear layer performs a matrix multiplication followed by a blocking all-reduce across tensor-parallel ranks.

In a standard RowParallelLinear forward pass, the execution is sequential:

Compute output = matmul(input, weight).
Perform a blocking all-reduce of output across tensor-parallel ranks.

With chunking enabled, the forward pass becomes:

Split the input into N chunks along the token dimension.
For each chunk i:

Compute output_i = matmul(input_i, weight).
Launch all_reduce(output_i) asynchronously.

Wait for all async all-reduce operations to complete.
Concatenate the chunk outputs.

This pipelining allows the all-reduce of chunk i to run concurrently with the matmul of chunk i+1, reducing idle time on both the compute engine and the network interface.

Chunking Conditions¶

Chunking is only applied when all of the following conditions are met:

Chunking is configured: num_chunks > 1
All-reduce is required: reduce_results is enabled on the layer
Tensor parallelism is active: tp_size > 1
Input is sufficiently large: total_tokens >= chunk_threshold

When any condition is not met, the layer falls back to the standard single-shot computation.

Input Handling¶

The implementation handles both 2D [tokens, hidden] and 3D [batch, seq_len, hidden] inputs:

3D with seq_len > 1 (prefill): Chunks along the sequence dimension
3D with seq_len == 1 (decode): Chunks along the batch dimension
2D: Chunks along the first token dimension

Recommended Usage¶

Chunking effectiveness depends on balancing communication overlap benefits against computational overhead.

We recommend this feature for:

Running with tensor parallelism (TP > 1)
Serving workloads with large batch sizes or long prefill sequences
Workloads where all-reduce communication is a significant fraction of the prefill time

Configuration¶

The feature is controlled by two environment variables:

Environment Variable	Config Name	Default	Description
`VLLM_ROW_PARALLEL_CHUNKS`	`row_parallel_chunks`	`1` (disabled)	The number of chunks to split the input into. Setting the variable to a value greater than 1 enables chunking.
`VLLM_ROW_PARALLEL_CHUNK_THRESHOLD`	`row_parallel_chunk_threshold`	`8192`	The minimum number of tokens required to activate chunking. Inputs below this threshold use the standard path.

The following example shows how to set these variables to enable chunking with different configurations:

# Enable chunking with 8 chunks (the default threshold of 8192 tokens)
export VLLM_ROW_PARALLEL_CHUNKS=8

# Enable chunking with 16 chunks and a lower threshold
export VLLM_ROW_PARALLEL_CHUNKS=16
export VLLM_ROW_PARALLEL_CHUNK_THRESHOLD=4096

Performance Characteristics¶

The speedup ratio is the baseline time divided by the chunked time, where values higher than 1.0 indicate speedup. The following tables show the speedup ratio for an isolated RowParallelLinear layer measured across different tensor parallelism sizes, chunk counts, and token counts. To interpret the values, consider the following:

Values greater than 1.0 indicate a speedup over the non-chunked baseline, for example, 1.5 means 50% faster.
Values below 1.0 indicate a slowdown due to chunking overhead exceeding the overlap benefit.

Tensor parallelism size equal to 2:

Tokens	2 chunks	4 chunks	8 chunks	16 chunks	32 chunks	64 chunks
1024	1.089	0.811	0.574	0.308	0.207	0.108
2048	1.076	1.161	0.753	0.477	0.334	0.176
4096	1.322	1.393	1.431	0.810	0.656	0.351
8192	1.158	1.482	1.453	1.204	1.067	0.620
16384	1.239	1.316	1.589	1.489	1.514	1.075
32768	1.246	1.460	1.434	1.649	1.563	1.569
65536	1.246	1.424	1.580	1.483	1.555	1.548
131072	1.268	1.442	1.503	1.676	1.533	1.514

Tensor parallelism size equal to 4:

Tokens	2 chunks	4 chunks	8 chunks	16 chunks	32 chunks	64 chunks
1024	0.892	0.579	0.374	0.195	0.104	0.060
2048	1.035	0.888	0.509	0.307	0.142	0.088
4096	1.156	1.081	0.795	0.466	0.245	0.134
8192	1.171	1.304	1.255	0.749	0.485	0.244
16384	1.162	1.309	1.416	1.216	0.780	0.496
32768	1.118	1.237	1.280	1.427	1.201	0.766
65536	1.218	1.310	1.386	1.528	1.553	1.195
131072	1.193	1.387	1.332	1.353	1.560	1.545

Tensor parallelism size equal to 8:

Tokens	2 chunks	8 chunks	16 chunks	32 chunks	64 chunks
1024	0.656	0.253	0.132	0.075	0.045
2048	0.828	0.324	0.183	0.087	0.051
4096	0.919	0.495	0.264	0.146	0.078
8192	0.993	0.705	0.470	0.240	0.157
16384	0.990	1.024	0.684	0.402	0.249
32768	0.972	1.118	0.942	0.760	0.469
65536	0.989	1.164	1.129	1.179	0.758
131072	1.018	1.241	1.277	1.297	1.090

Performance Insights¶

Optimal configuration varies with tensor parallelism size and sequence length. There is no single setting that will benefit all benchmarks, so we recommend experimenting to find which configuration works best for your specific tensor parallelism size and sequence length. A good starting point is 2 chunks with a 4096-token threshold.

Diminishing returns with too many chunks. Excessively fine chunking introduces overhead from graph breaks, kernel launch latency, and reduced per-chunk compute efficiency. The optimal chunk count depends on both tensor parallelism size and typical token count.

Recommended Settings¶

The following recommendations are based on isolated layer benchmarks using the Meta Llama 3.3 70B model. In end-to-end inference, the optimal configuration may differ depending on the model architecture, sequence lengths, and workload mix. We recommend benchmarking with your specific setup.

Tensor parallelism size	Recommended chunks	Notes
1	1 (disabled)	No all-reduce needed; chunking adds overhead only
2	2-64	Beneficial for token counts ≥ 4096
4	2-16	Beneficial for token counts ≥ 8192
8	2-8	Beneficial for token counts ≥ 16384

Implementation Details¶

This feature is implemented in vllm_gaudi/ops/hpu_row_parallel_linear.py as HPURowParallelLinear, which registers as an out-of-tree (OOT) override for vLLM's RowParallelLinear. The chunking logic is entirely self-contained in the forward method and does not modify any other part of the model or the inference pipeline.

Each chunk boundary introduces a torch._dynamo.graph_break() to ensure correct async all-reduce semantics under torch.compile. This means the compiled graph will be split at chunk boundaries, which is a necessary trade-off for enabling async communication.