vllm.model_executor.layers.fused_moe.oracle ¶
Modules:
-
base–Abstract base class for MoE kernel oracles.
-
fp8– -
int8– -
int_wna16– -
mxfp4– -
mxfp8– -
nvfp4– -
unquantized– -
w4a8_int8–
Classes:
-
MoEKernelOracle–Abstract base for MoE kernel-selection oracles.
-
UnquantizedMoEKernelOracle–Class-based view of the unquantized MoE kernel oracle.
MoEKernelOracle ¶
Abstract base for MoE kernel-selection oracles.
Concrete oracles MUST implement: backend_enum_cls, get_priority_backends, backend_to_kernel_cls, map_backend, select_backend, make_kernel.
Concrete oracles MAY override: convert_to_kernel_format, make_quant_config. The base class provides default implementations that are appropriate for oracles which do not need them (e.g. make_quant_config raises on the unquantized oracle).
Methods:
-
backend_enum_cls–Return the concrete
Enumclass enumerating this oracle's -
backend_to_kernel_cls–Map a backend enum value to its concrete
FusedMoEExperts -
convert_to_kernel_format–Shuffle weights into the layout expected by
backend. -
get_priority_backends–Return platform-appropriate backends in priority order for
-
make_kernel–Construct the
FusedMoEKernel(Prepare/Finalize + Experts -
make_quant_config–Build a
FusedMoEQuantConfigfor this oracle. -
map_backend–Map a user-facing
MoEBackend(from the runner config) to -
select_backend–Primary entry point: choose the best supported backend for
Source code in vllm/model_executor/layers/fused_moe/oracle/base.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | |
backend_enum_cls() abstractmethod ¶
Return the concrete Enum class enumerating this oracle's backends (e.g. UnquantizedMoeBackend, Fp8MoeBackend).
backend_to_kernel_cls(backend) abstractmethod ¶
Map a backend enum value to its concrete FusedMoEExperts subclass.
convert_to_kernel_format(backend, moe_config, w13_weight, w2_weight) ¶
Shuffle weights into the layout expected by backend.
Default implementation returns the inputs unchanged. Oracles whose backends need weight permutation should override this (e.g. UnquantizedMoEKernelOracle handles AITER and FlashInfer layouts).
moe_config carries MoE-layer state (e.g. is_act_and_mul) that the conversion needs without coupling the oracle to a Module reference. Quantized oracles whose conversion additionally needs scales / zero-points / block shapes will override with a wider signature (and ultimately a per-oracle config object — tracked in the #37753 follow-up PRs).
Source code in vllm/model_executor/layers/fused_moe/oracle/base.py
get_priority_backends(moe_config) abstractmethod ¶
Return platform-appropriate backends in priority order for this moe_config.
make_kernel(quant_config, moe_config, backend, experts_cls, routing_tables=None) abstractmethod ¶
Construct the FusedMoEKernel (Prepare/Finalize + Experts combinator) for the chosen backend.
Source code in vllm/model_executor/layers/fused_moe/oracle/base.py
make_quant_config(*args, **kwargs) ¶
Build a FusedMoEQuantConfig for this oracle.
Quantized oracles (fp8, nvfp4, mxfp4, ...) override this with the appropriate signature for their quantization scheme. Unquantized oracles inherit the default, which raises because there is no quantization-specific config to build.
Source code in vllm/model_executor/layers/fused_moe/oracle/base.py
map_backend(runner_backend) abstractmethod ¶
Map a user-facing MoEBackend (from the runner config) to this oracle's enum.
select_backend(moe_config, weight_key=None, activation_key=None) abstractmethod ¶
Primary entry point: choose the best supported backend for the given moe_config.
weight_key / activation_key carry the quantization scheme of the weights and activations and are consumed by quantized oracles (fp8, nvfp4, int8, ...) to disambiguate backends. The unquantized oracle ignores them. Subclasses with additional selection inputs (e.g. int_wna16 needs weight_bits, fp8 needs allow_vllm_cutlass) widen the signature in their override; a per-oracle config object is the longer-term target tracked in the #37753 follow-up PRs.
Source code in vllm/model_executor/layers/fused_moe/oracle/base.py
UnquantizedMoEKernelOracle ¶
Bases: MoEKernelOracle[UnquantizedMoeBackend]
Class-based view of the unquantized MoE kernel oracle.
Each method delegates to its module-level counterpart so that instantiating and calling this class is bit-identical to calling the standalone functions. Follow-up PRs may move logic from the module-level functions into these methods.