序列并行¶

什么是序列并行¶

序列并行（SP）最早在 Megatron 中提出，初衷是减少训练时的激活内存。其核心改动是将 Allreduce->LayerNorm 替换为 ReduceScatter->LayerNorm->Allgather。该技术后来被 vllm 应用于推理。需要注意的是，将 Allreduce 拆分为 ReduceScatter 和 Allgather 本身并不会带来性能提升；它减少了 LayerNorm 的计算量，但收益很小。SP 的真正优势在于：

LLM 推理部署常使用量化。以 NPU 上常用的 INT8 量化为例，在 LayerNorm 之后，Quant 算子将隐藏状态从 BF16 量化到 INT8。Allgather 的通信量减半，耗时也几乎减半。
ReduceScatter 和 Allgather 可以分别与前后 Matmul 操作融合为通信-计算并行算子，降低延迟。

如何使用¶

目前，vllm-ascend 已基于 Inductor pass 为 VL 类模型实现了序列并行。可通过以下方式启用：

vllm serve Qwen/Qwen3-VL-2B-Instruct \
    --tensor-parallel-size 2 \
    --compilation-config '{"pass_config": {"enable_sp": true , "sp_min_token_num": 1000}}'

"enable_sp"：SP 的开关。由于 SP 依赖图模式，eager 模式下不支持。
sp_min_token_num（来自上游 vllm 的 pass_config）：根据我们的实验，当 token 数量较少时（经验值小于 1000），SP 实际上可能带来负面影响。这是因为当通信量较小时，通信算子的固定开销成为主导因素。SP 仅在 num_tokens >= sp_min_token_num 时生效。Ascend 上的默认值为 1000，一般无需修改。 如需自定义，请使用 --compilation-config '{"pass_config": {"enable_sp": true, "sp_min_token_num": 512}}'。该值会被追加到 compile_ranges_split_points 中，用于分割图编译范围并逐范围检查 pass 是否适用。

在不修改 sp_min_token_num 的情况下，启用 SP 的最简单且推荐的方式是：

vllm serve Qwen/Qwen3-VL-2B-Instruct \
    --tensor-parallel-size 2 \
    --compilation-config '{"pass_config": {"enable_sp": true}}'

SP 与 Flash Comm V1 的区别¶

Flash Comm V1 (FC1) 是基于 NPU 开发的序列并行增强版本。增强内容包括：

对于使用 MLA 结构的模型，Allgather 被推迟到 QKV 投影之后，进一步减少通信量。
对于 MoE 模型，Allgather 被推迟到 Gating+DynamicQuant 之后，同样旨在减少通信量。

FC1 是 vllm-ascend 中的独特优化，目前基于 Custom OP 实现，但难以支持 VL 类模型（原因详见 [RFC]: support sequence parallelism by pass）。因此，目前 FC1 和 SP 是互补的。

支持矩阵¶

无量化¶

	VL + Dense	VL + MoE	non-VL + Dense	non-VL + MoE
序列并行	graph	graph	x	x
Flash Comm V1	x	x	eager/graph	eager/graph

有量化¶

SP 目前不支持量化，正在适配中。

	VL + Dense	VL + MoE	non-VL + Dense	non-VL + MoE
序列并行	x	x	x	x
Flash Comm V1	x	x	eager/graph	eager/graph

Pass 设计¶

启用 SP 时，按顺序运行以下 pass：SequenceParallelismPass 然后 SequenceParallelismMoePass。

SequenceParallelismPass¶

首先运行 NoOpEliminationPass 以消除冗余的类 view 操作，然后应用基于 AllReduce 的模式：

模式	匹配	替换
`MiddleAllReduceRMSNormPattern`	`all_reduce` + `layernorm`	`reduce_scatter` + `layernorm` + `all_gather`
`LastAllReduceRMSNormPattern`	相同（最后一层，无残差）	相同
`Qwen3VLMiddleAllReduceRMSNormPattern`	`all_reduce` + add + `layernorm`	`reduce_scatter` + chunk(`deepstack_input_embeds`) + add + `layernorm` + `all_gather`

为什么 Qwen3 VL 需要 Qwen3VLMiddleAllReduceRMSNormPattern 的特殊处理

Qwen3-VL 中间层在 all_reduce 和 layernorm 之间插入了一个额外的加法操作：hidden_states=hidden_states + deepstack_input_embeds。在 SP 下，hidden_states（即 input）经过 reduce-scatter 后，每个 rank 的形状为 [seq_len/tp, hidden]，而 deepstack_input_embeds 来自视觉/深度堆叠路径，保持完整序列 [seq_len, hidden]（通常在 TP rank 间复制）。直接执行 reduce_scatter(input) + deepstack_input_embeds 会导致形状不匹配。修复方法是将 deepstack_input_embeds 按 tp_size 分块，使得每个 rank 使用 add(reduce_scatter, chunk(deepstack_input_embeds)[tp_rank])，从而在 layernorm 和 all_gather 之前保持形状一致。

SequenceParallelismMoePass¶

应用 SequenceParallelismPass 后，MoE 模型的计算图如下所示：

AllGather EP computation graph

概述

推迟 allgather：在 SP 下，residual 按张量并行度分块。这会导致下一层 layernorm 中 hidden states 与 residual 的形状不匹配：hidden states 是聚合的（完整序列），而 residual 仍然是分块的。修复方法是将 all_gather 移到 layernorm 之后，使得 layernorm 在每个 rank 上对一致的形状进行操作。MiddleLayerAllgatherAddRMSNormPattern、LastLayerAllgatherRMSNormPattern 和 Qwen3VLMiddleLayerAllgatherAddRMSNormPattern 专为此目的设计，分别处理不同的层和结构变体（见下表）。
AllGatherChunkNoOp 清理：当启用 MoE SP 时，vllm 会引入一个 sequence_parallel_chunk 操作（对应图中的 sp_chunk）。与前面的 all_gather 一起，这对操作形成了一个冗余的无操作（all_gather 聚合，然后 chunk 重新拆分）。AllGatherChunkNoOpPattern 将此对替换为恒等操作，以消除冗余的通信和计算。

模式详情：

模式	匹配	替换
`MiddleLayerAllgatherAddRMSNormPattern`	`all_gather` + slice + `layernorm`	`layernorm` + `all_gather`
`LastLayerAllgatherRMSNormPattern`	相同（最后一层，无残差）	相同
`Qwen3VLMiddleLayerAllgatherAddRMSNormPattern`	`all_gather` + slice + add + `layernorm`	add(chunk) + `layernorm` + `all_gather`
`AllGatherChunkNoOpPattern`	`all_gather` + `sequence_parallel_chunk_impl`	恒等（无操作）

常见问题¶

Q1：SP 默认启用吗？¶

不，SP 默认不启用。SP 目前处于实验阶段，未来将默认启用。

代码中 enable_sp 的处理流程如下：

在 pass_config 中，enable_sp 和 sp_min_token_num 默认为 None
NPUPlatform.apply_config_platform_defaults：如果 enable_sp 为 True 且 sp_min_token_num 为 None，则设置默认的 sp_min_token_num（Dense 模型为 1000，MoE 模型为 1）
VllmConfig._apply_optimization_level_defaults：对于 dense 模型，enable_sp 设置为 True。
VllmConfig.__post_init__：如果 sp_min_token_num 仍为 None，则将 enable_sp 设置为 False