vllm.model_executor.layers.fused_moe.deep_gemm_moe
DeepGemmExperts
¶
Bases: FusedMoEPermuteExpertsUnpermute
Source code in vllm/model_executor/layers/fused_moe/deep_gemm_moe.py
__init__
¶
apply
¶
apply(
hidden_states: Tensor,
w1: Tensor,
w2: Tensor,
topk_ids: Tensor,
activation: str,
global_num_experts: int,
expert_map: Optional[Tensor],
w1_scale: Optional[Tensor],
w2_scale: Optional[Tensor],
w1_zp: Optional[Tensor],
w2_zp: Optional[Tensor],
a1q_scale: Optional[Tensor],
a2_scale: Optional[Tensor],
workspace13: Tensor,
workspace2: Tensor,
expert_num_tokens: Optional[Tensor],
) -> Tensor
Source code in vllm/model_executor/layers/fused_moe/deep_gemm_moe.py
workspace_shapes
¶
workspace_shapes(
a: Tensor,
M: int,
N: int,
K: int,
topk: int,
num_experts: int,
) -> tuple[int, int, dtype]
Source code in vllm/model_executor/layers/fused_moe/deep_gemm_moe.py
_valid_deep_gemm
¶
_valid_deep_gemm(
hidden_states: Tensor,
w1: Tensor,
w2: Tensor,
expert_map: Optional[Tensor] = None,
) -> bool
Check if the given problem size is supported by the DeepGemm grouped
gemm kernel. All of M, N, K and the quantization block_shape must be
aligned by dg.get_m_alignment_for_contiguous_layout().
Source code in vllm/model_executor/layers/fused_moe/deep_gemm_moe.py
_valid_deep_gemm_shape
¶
deep_gemm_block_shape
cached
¶
deep_gemm_moe_fp8
¶
deep_gemm_moe_fp8(
hidden_states: Tensor,
w1: Tensor,
w2: Tensor,
w1_scale: Tensor,
w2_scale: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
inplace: bool = False,
activation: str = "silu",
global_num_experts: int = -1,
expert_map: Optional[Tensor] = None,
a1_scale: Optional[Tensor] = None,
a2_scale: Optional[Tensor] = None,
apply_router_weight_on_input=False,
) -> Tensor
This function computes a a8w8-quantized Mixture of Experts (MoE) layer using two sets of quantized weights, w1_q and w2_q, and top-k gating mechanism. The matrix multiplications are implemented with DeepGemm grouped gemm.
- hidden_states (torch.Tensor): The input tensor to the MoE layer. Shape: [M, K]
- w1 (torch.Tensor): The first set of fp8 quantized expert weights. Shape: [num_experts, K, 2N] (the weights are passed transposed)
- w2 (torch.Tensor): The second set of fp8 quantized expert weights. Shape: [num_experts, N, K] (the weights are passed transposed)
- w1_scale (torch.Tensor): The fp32 scale to dequantize w1_q. Shape: [num_experts] or [num_experts, 2N]
- w2_scale (torch.Tensor): The fp32 scale to dequantize w2_q. Shape: [num_experts] or [num_experts, K]
- topk_weights (torch.Tensor): The weights of each token->expert mapping.
- topk_ids (torch.Tensor): The token->expert mapping for topk_weights.
- inplace (bool): If True, perform the operation in-place. Defaults to False.
- activation (str): The activation function to apply after the first MoE layer.
- global_num_experts (int): The total number of experts in the global expert space.
- expert_map (Optional[torch.Tensor]): A tensor mapping expert indices from the global expert space to the local expert space of the expert parallel shard.
- a1_scale (Optional[torch.Tensor]): The optional fp32 scale to quantize a. Shape: scalar or [M]
- a2_scale (Optional[torch.Tensor]): The optional fp32 scale to quantize the intermediate result between the gemms. Shape: scalar or [M]
Returns: - torch.Tensor: The bfloat16 output tensor after applying the MoE layer.