vllm_gaudi.ops.hpu_fused_moe
¶
HPUUnquantizedFusedMoEMethod
¶
Bases: UnquantizedFusedMoEMethod
MoE method without quantization.
Source code in vllm_gaudi/ops/hpu_fused_moe.py
__init__
¶
forward_oot
¶
Source code in vllm_gaudi/ops/hpu_fused_moe.py
process_weights_after_loading
¶
process_weights_after_loading(layer: Module) -> None
Source code in vllm_gaudi/ops/hpu_fused_moe.py
get_compressed_expert_map
¶
Compresses the expert map by removing any -1 entries.
This implementation uses a standard Python loop, which is compatible with
graph compilation modes that do not support dynamic shapes resulting from
operations like torch.where.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expert_map
|
Tensor
|
A tensor of shape (global_num_experts,) mapping a global expert index to its local index. Contains -1 for experts that are not assigned to the current rank. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
A string mapping from local to global index, |
str
|
ordered by global index. (e.g., "0->5, 1->12, 2->23") |
Source code in vllm_gaudi/ops/hpu_fused_moe.py
patched_fused_moe_forward
¶
patched_fused_moe_forward(
self, hidden_states: Tensor, router_logits: Tensor
) -> Union[Tensor, tuple[Tensor, Tensor]]
Patched forward method that bypasses the custom op to avoid recompilation issues.
Source code in vllm_gaudi/ops/hpu_fused_moe.py
patched_grouped_topk
¶
patched_grouped_topk(
hidden_states: Tensor,
gating_output: Tensor,
topk: int,
renormalize: bool,
num_expert_group: int = 0,
topk_group: int = 0,
scoring_func: str = "softmax",
routed_scaling_factor: float = 1.0,
e_score_correction_bias: Tensor | None = None,
) -> tuple[Tensor, Tensor]