vllm.model_executor.model_loader.ep_weight_filter ¶
Filter out non-local expert weights during loading to avoid redundant I/O.
In DP+EP deployments each rank only needs its own expert shard. Skipping non-local expert tensors before they are read from disk eliminates the majority of storage I/O for MoE models (experts typically account for ~85-90 % of total weight bytes).
Functions:
-
compute_local_expert_ids–Compute the set of global expert ids owned by ep_rank.
-
parse_expert_id–Return the expert id embedded in weight_name, or
Noneif it is -
should_skip_weight–Return
Trueif weight_name is an expert weight that does not
compute_local_expert_ids(num_experts, ep_size, ep_rank, placement='linear') ¶
Compute the set of global expert ids owned by ep_rank.
Returns None when EP is not active (ep_size <= 1), meaning all experts are local and no filtering should be performed.
The distribution logic mirrors :func:vllm.model_executor.layers.fused_moe.layer.determine_expert_map.
Parameters:
-
(placement¶str, default:'linear') –"linear"for contiguous assignment,"round_robin"for interleaved assignment.
Source code in vllm/model_executor/model_loader/ep_weight_filter.py
parse_expert_id(weight_name) ¶
Return the expert id embedded in weight_name, or None if it is not an per-expert weight.
Returns None for dense weights (attention, layernorm, embedding), shared experts, and 3D fused-expert tensors where all experts are stored in a single tensor without a numeric expert id in the name.
Source code in vllm/model_executor/model_loader/ep_weight_filter.py
should_skip_weight(weight_name, local_expert_ids) ¶
Return True if weight_name is an expert weight that does not belong to the local rank and should be skipped during loading.