Graph Mode Guide#

Overview#

This guide explains how graph mode is used in vLLM Ascend.

vLLM already provides the generic graph-mode architecture, mode definitions, and compile integration. For those upstream concepts, see:

This document focuses on the Ascend-specific view: how graph mode works on Ascend, which components are involved, how to configure them, and what constraints users should keep in mind.

Current Status on Ascend#

  • Graph mode is currently available only on the V1 Engine.

  • ACLGraph (capture/replay via torch.npu.NPUGraph) is the runtime graph execution mechanism used by the default graph path on Ascend.

  • Npugraph_ex is a compile-time FX graph optimization layer, enabled by default in FULL/FULL_DECODE_ONLY modes. It optimizes the graph before ACLGraph captures it.

  • XliteGraph is an optional graph path for selected model families and environments.

  • In context parallel scenarios, cudagraph_mode="FULL" is not sufficiently supported yet.

Graph Paths on Ascend#

vLLM Ascend provides two graph paths:

Graph Path

Default

Description

Since

ACLGraph (+ Npugraph_ex)

Yes

Compile-time FX optimization (Npugraph_ex) + runtime capture/replay (ACLGraph)

v0.9.0rc1 (Npugraph_ex since v0.15.0rc1)

XliteGraph

No

Preconfigured graph path for selected model families. Requires separate installation

v0.11.0

How Graph Mode Works on Ascend#

The default graph path on Ascend involves two stages: compile-time optimization and runtime capture/replay. ACLGraph handles the runtime capture/replay. The compile-time stage differs by cudagraph_mode:

  • FULL_AND_PIECEWISE: Default mode, same as the upstream vLLM strategy. The compile-time path follows PIECEWISE compilation, while the runtime may still use full-graph behavior for uniform decode batches.

  • FULL / FULL_DECODE_ONLY: Npugraph_ex optimizes the FX graph via npugraph_ex (force_eager=True, compile-time only, no capture). The optimized callable is then captured and replayed by ACLGraph at runtime.

  • PIECEWISE: Npugraph_ex is disabled. Only basic FX fusion passes are applied at compile-time. ACLGraph captures and replays the resulting callable at runtime.

  • NONE: No compilation or graph capture. The model runs in eager mode.

cudagraph_mode

Compile-time

Runtime

Npugraph_ex

FULL_AND_PIECEWISE

Piecewise compilation path

Mixed: PIECEWISE for mixed batches, FULL-capable for uniform decode batches

Disabled

FULL / FULL_DECODE_ONLY

Npugraph_ex FX optimization

ACLGraph capture/replay

Enabled

PIECEWISE

Fusion pass only

ACLGraph capture/replay

Disabled

NONE

None

Eager execution

Disabled

Additionally, XliteGraph is available as an optional alternative graph path for selected model families (see Using XliteGraph).

Using ACLGraph#

ACLGraph is the runtime graph capture/replay mechanism on Ascend. It is enabled automatically when graph mode is active (i.e., cudagraph_mode is not NONE), and does not require explicit configuration.

Basic usage#

Offline example:

from vllm import LLM

llm = LLM(model="path/to/Qwen3-0.6B")
outputs = llm.generate("Hello, how are you?")

Online example:

vllm serve Qwen/Qwen3-0.6B

Explicit cudagraph_mode configuration#

The generic cudagraph_mode options come from upstream vLLM. On Ascend, the final effective mode may still be adjusted according to platform and backend support, so the official vLLM CUDA Graphs document remains the canonical reference for mode semantics.

CLI example:

vllm serve Qwen/Qwen3-0.6B \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}'

Python example:

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen3-0.6B",
    compilation_config={"cudagraph_mode": "PIECEWISE"},
)

For the detailed meaning of NONE, PIECEWISE, FULL, FULL_DECODE_ONLY, and FULL_AND_PIECEWISE, as well as the generic fallback policy, see the upstream CUDA Graphs design doc.

Attention backend compatibility#

Not all attention backends support all graph modes. vLLM checks attention backend compatibility during compatibility checks and, when possible, automatically adjusts cudagraph_mode to a more compatible mode instead of failing immediately. In practice, this means a requested full-graph mode may be narrowed to a mixed or piecewise mode, and if the backend cannot support graph execution at all, graph mode may be disabled.

On Ascend, the current attention backend support levels are:

Attention backend

Declared support

Practical meaning

attention_v1

ALWAYS

Supports graph execution for mixed prefill/decode batches

context_parallel/attention_cp

ALWAYS

Supports graph execution for mixed prefill/decode batches

mla_v1

UNIFORM_BATCH

Graph execution is limited to uniform batches; full graph is more restricted

context_parallel/mla_cp

UNIFORM_BATCH

Graph execution is limited to uniform batches; full graph is more restricted

sfa_v1

UNIFORM_BATCH

Graph execution is limited to uniform batches; full graph is more restricted

context_parallel/sfa_cp

UNIFORM_BATCH

Graph execution is limited to uniform batches; full graph is more restricted

This is why the effective graph mode on Ascend may differ from the mode requested in configuration.

Troubleshooting capture resource exhaustion#

If ACLGraph capture fails because the configured graph sizes exceed the runtime resources available on the current stack, vLLM Ascend now raises a dedicated error with mitigation guidance. In practice, the most useful actions are:

  • upgrade to a newer HDK/CANN stack if one is available;

  • reduce cudagraph_capture_sizes or max_cudagraph_capture_size;

  • prefer FULL or FULL_DECODE_ONLY when the workload is mostly uniform decode;

  • temporarily disable graph mode to confirm the issue is capture-related.

This is most likely to appear in PIECEWISE or FULL_AND_PIECEWISE configurations because those paths tend to capture more graphs than uniform full-graph decode.

Using Npugraph_ex#

As introduced in the RFC, Npugraph_ex is a compile-time FX graph optimization layer that works together with ACLGraph. It optimizes the model’s FX graph before ACLGraph captures it at runtime. Its performance benefits mainly come from fusing multiple operators into single kernels (e.g., add + rms_norm → npu_add_rms_norm) to reduce kernel launch overhead.

Default behavior#

Npugraph_ex is enabled by default when cudagraph_mode is FULL or FULL_DECODE_ONLY. It is automatically disabled in PIECEWISE or NONE modes.

This means for most users, Npugraph_ex is active without any explicit configuration:

from vllm import LLM

# Npugraph_ex is enabled by default in FULL/FULL_DECODE_ONLY mode
llm = LLM(model="path/to/Qwen2-7B-Instruct")
outputs = llm.generate("Hello, how are you?")

Explicit configuration#

To explicitly control Npugraph_ex:

Offline example:

from vllm import LLM

model = LLM(
    model="path/to/Qwen2-7B-Instruct",
    additional_config={
        "ascend_compilation_config": {
            "enable_npugraph_ex": True,
        }
    }
)
outputs = model.generate("Hello, how are you?")

Online example:

vllm serve Qwen/Qwen2-7B-Instruct \
  --additional-config '{"ascend_compilation_config":{"enable_npugraph_ex":true}}'

To disable Npugraph_ex explicitly:

vllm serve Qwen/Qwen2-7B-Instruct \
  --additional-config '{"ascend_compilation_config":{"enable_npugraph_ex":false}}'

Static kernel compilation#

Static kernel compilation is an optional feature that pre-compiles operator binaries with fixed shapes at compile time, reducing runtime overhead for networks with static or near-static shapes. It is disabled by default and must be explicitly enabled.

Note

Enabling static kernel triggers a compilation pass during the graph capture phase at service startup. This may add several minutes to tens of minutes to the startup time depending on the number of operators to compile and model complexity. Once completed, subsequent request processing is not affected.

Offline example:

from vllm import LLM

model = LLM(
    model="path/to/Qwen2-7B-Instruct",
    additional_config={
        "ascend_compilation_config": {
            "enable_npugraph_ex": True,
            "enable_static_kernel": True,
        }
    }
)
outputs = model.generate("Hello, how are you?")

Online example:

vllm serve Qwen/Qwen2-7B-Instruct \
  --additional-config '{"ascend_compilation_config":{"enable_npugraph_ex":true, "enable_static_kernel":true}}'

Verifying static kernel is active#

The recommended way to verify static kernel is in effect is through Ascend Profiling:

  1. Collect a profiling trace of your running model using Ascend PyTorch Profiler (torch_npu.profiler).

  2. Open the generated op_statistic.csv file.

  3. Look for operators whose op_type or name column contains the keyword static_kernel. If such entries exist, static kernel compilation has taken effect for those operators.

During the compilation phase, you will see a Python warning (visible by default):

Starting static kernel compilation, the build directory is <path>

This confirms that compilation has been triggered. The absence of this message means static kernel was not enabled or the cached result was reused directly.

For more details about Npugraph_ex, see the npugraph_ex guide.

Using XliteGraph#

XliteGraph is an optional path for Llama, Qwen dense series models, Qwen MoE series models, and Qwen3-VL. It requires Xlite to be installed and configured through xlite_graph_config.

Install Xlite first:

pip install xlite

Offline example:

from vllm import LLM

# Xlite supports decode-only mode by default.
# Full mode can be enabled with "full_mode": True.
llm = LLM(
    model="path/to/Qwen3-32B",
    tensor_parallel_size=8,
    additional_config={
        "xlite_graph_config": {
            "enabled": True,
            "full_mode": True,
        }
    },
)
outputs = llm.generate("Hello, how are you?")

Online example:

vllm serve path/to/Qwen3-32B \
  --tensor-parallel-size 8 \
  --additional-config '{"xlite_graph_config": {"enabled": true, "full_mode": true}}'

For more details about Xlite, see the Xlite README.

Common Limitations and Caveats#

  • XliteGraph should be treated as an alternative graph path, not as a drop-in replacement for ACLGraph in all scenarios.

  • Model and backend coverage is still evolving, so a configuration that works for one model family may not yet be recommended for another.

  • Encoder-decoder models currently do not keep FULL_AND_PIECEWISE; on Ascend they fall back to PIECEWISE or NONE depending on compilation support.

Fallback to Eager Mode#

If you encounter issues with graph mode, you can temporarily fall back to eager mode by setting enforce_eager=True.

If ACL graph capture fails with the confirmed stream-resource signature in the error text, such as 207008 together with Stream resources are insufficient or Insufficient_Stream_Resources, vLLM Ascend will re-raise that capture failure with targeted mitigation guidance. In practice, the main levers are: upgrading to a newer HDK/CANN stack, reducing cudagraph_capture_sizes, lowering max_cudagraph_capture_size, or preferring FULL / FULL_DECODE_ONLY when the workload is mostly uniform decode.

Offline example:

from vllm import LLM

llm = LLM(model="path/to/your/model", enforce_eager=True)
outputs = llm.generate("Hello, how are you?")

Online example:

vllm serve path/to/your/model --enforce-eager

References#