vllm.model_executor.layers.fused_moe.triton_deep_gemm_moe
TritonOrDeepGemmExperts
¶
Bases: FusedMoEPermuteExpertsUnpermute
Source code in vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | |
triton_expert
instance-attribute
¶
triton_expert = TritonExperts(
use_fp8_w8a8=use_fp8_w8a8,
use_int8_w8a8=use_int8_w8a8,
use_int4_w4a16=use_int4_w4a16,
use_int8_w8a16=use_int8_w8a16,
per_channel_quant=per_channel_quant,
block_shape=block_shape,
block_m=block_m,
)
__init__
¶
__init__(
use_fp8_w8a8: bool = False,
use_int8_w8a8: bool = False,
use_int8_w8a16: bool = False,
use_int4_w4a16: bool = False,
per_channel_quant: bool = False,
block_shape: Optional[list[int]] = None,
block_m: Optional[int] = None,
allow_deep_gemm: bool = False,
)
Source code in vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py
apply
¶
apply(
hidden_states: Tensor,
w1: Tensor,
w2: Tensor,
topk_ids: Tensor,
activation: str,
global_num_experts: int,
expert_map: Optional[Tensor],
w1_scale: Optional[Tensor],
w2_scale: Optional[Tensor],
w1_zp: Optional[Tensor],
w2_zp: Optional[Tensor],
a1q_scale: Optional[Tensor],
a2_scale: Optional[Tensor],
workspace13: Tensor,
workspace2: Tensor,
expert_num_tokens: Optional[Tensor],
) -> Tensor
Source code in vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py
workspace_shapes
¶
workspace_shapes(
a: Tensor,
M: int,
N: int,
K: int,
topk: int,
num_experts: int,
) -> tuple[int, int, dtype]