Npugraph_ex¶

工作原理¶

这是一种基于FX图的优化，可视为aclgraph模式的加速方案。

您可获取其代码：torchair 源码仓库

默认FX图优化¶

FX图优化¶

对于模型的中间节点，将节点中包含的非就地操作符替换为就地操作符，以减少计算过程中的内存移动，提升性能。
对于模型的原始输入参数，若包含就地操作符，Dynamo的Functionalize过程会将其替换为“非就地操作符+复制操作符”的形式。npugraph_ex将逆转此过程，恢复就地操作符并减少内存移动。

FX融合优化¶

npugraph_ex目前提供了一些算子融合优化，未来将增加更多。

满足替换规则的算子组合可被替换为对应的融合算子。

您可获取默认融合优化列表

自定义融合优化¶

用户可在npugraph_ex中注册自定义图融合优化，以修改PyTorch FX图。注册依赖于register_replacement API。

以下是该API的声明及使用示例。

register_replacement(search_fn, replace_fn, example_inputs, trace_fn=fwd_only, extra_check=_return_true, search_fn_pattern=None)

参数名称	输入/输出	说明	是否必需
search_fn	输入	该函数是您希望在FX图中识别的算子组合或计算逻辑，例如需要融合的算子组合	是
replace_fn	输入	当在目标图中找到与search_fn对应的组合时，该函数的计算逻辑将替换原子图，以实现算子融合或优化	是
example_inputs	输入	用于追踪search_fn和replace_fn的示例输入张量。输入的形状和数据类型应与实际场景匹配	是
trace_fn	输入	默认仅追踪前向计算图，适用于推理阶段的优化；如需支持训练场景，可提供支持反向追踪的函数	否
extra_check	输入	算子融合后的额外验证函数。该函数的输入参数必须是torch._inductor.pattern_matcher中的Match对象，用于对匹配结果进行进一步的自定义检查，例如检查融合算子是否在同一流上、检查设备类型、检查输入形状等	否
search_fn_pattern	输入	通常无需提供自定义模式对象。其定义遵循原生PyTorch MultiOutputPattern对象的规则。传入此参数后，search_fn将不再用于匹配算子组合，而是直接使用此参数作为匹配规则	否

使用示例

import functools
import torch, torch_npu, npugraph_ex

from torch._inductor.pattern_matcher import Match
from torch._subclasses.fake_tensor import FakeTensorMode
from npugraph_ex.core.utils import logger

# Assume fusing the add operator and the npu_rms_norm operator into the npu_add_rms_norm operator
# Define a search_fn to find the operator combinations in the original FX graph before fusion.
def search_fn(x1, x2, gamma):
    xOut = torch.add(x1, x2)
    y, _ = torch_npu.npu_rms_norm(xOut, gamma)
    return y, xOut

# Define a replace_fn, that is, a fusion operator, used to replace operator combinations in the FX graph
def replace_fn(x1, x2, gamma):
    y, _, xOut = torch_npu.npu_add_rms_norm(
        x1, x2, gamma
    )
    return y, xOut

# extra_check can pass in additional validation logic. Here, it is used to check whether the last dimension of the first input parameter x1 is a specific value; if it is not the specific value, fusion is not allowed.
def extra_check(match: Match):
    x1 = match.kwargs.get("x1")

    if x1 is None:
        return False 
    if not hasattr(x1, "meta") or "val" not in x1.meta:
        return False

    a_shape = x1.meta["val"].shape
    return a_shape[-1] == 7168 

# Define some sample inputs to trace search_fn and replace_fn into an FX graph
fake_mode = FakeTensorMode()
with fake_mode:
    # sizes/values don't actually matter for initial trace
    # once we get a possible match we re-trace with the actual values and verify the match still holds
    input_tensor = functools.partial(torch.empty, (1, 1, 2), device="npu", dtype=torch.float16)
    kwargs_tensor = functools.partial(torch.empty, 2, device="npu", dtype=torch.float16)

    # Call the npugraph_ex.register_replacement API with search_fn, replace_fn, and example_inputs. If there are additional validations, you can pass them in as extra_check.
    npugraph_ex.register_replacement(
        search_fn=search_fn,
        replace_fn=replace_fn,
        example_inputs=(input_tensor(), input_tensor(), kwargs_tensor()),
        extra_check=extra_check
    )

npugraph_ex中的默认融合优化也基于此API实现。您可在vllm-ascend和npugraph_ex代码仓库中查看更多使用此API的示例。

DFX¶

通过复用PyTorch社区的TORCH_COMPILE_DEBUG环境变量，当设置TORCH_COMPILE_DEBUG=1时，将输出整个过程中的FX图。