常见问题解答

常见问题解答#

版本特定常见问题#

[v0.11.0] 常见问题与反馈

通用常见问题解答#

1. What devices are currently supported?#

目前，仅支持 Atlas A2 系列（Ascend-cann-kernels-910b）、Atlas A3 系列（Atlas-A3-cann-kernels）和 Atlas 300I（Ascend-cann-kernels-310p）系列：

Atlas A2 训练系列（Atlas 800T A2、Atlas 900 A2 PoD、Atlas 200T A2 Box16、Atlas 300T A2）
Atlas 800I A2 推理系列（Atlas 800I A2）
Atlas A3 训练系列（Atlas 800T A3、Atlas 900 A3 SuperPoD、Atlas 9000 A3 SuperPoD）
Atlas 800I A3 推理系列（Atlas 800I A3）
[实验性] Atlas 300I 推理系列（Atlas 300I Duo）。目前对于 310I Duo，稳定版本是 vllm-ascend v0.10.0rc1。

以下系列目前尚未支持：

Atlas 200I A2（Ascend-cann-kernels-310b）暂未计划支持
Ascend 910、Ascend 910 Pro B（Ascend-cann-kernels-910）暂未计划支持

从技术角度看，如果支持 torch-npu，那么 vllm-ascend 就有可能得到支持。否则，我们必须通过自定义算子来实现。我们也欢迎您加入我们，共同改进。

2. How to get our Docker containers?#

您可以在 Quay.io 获取我们的容器，例如：vllm-ascend 和 cann。

如果您在中国，可以使用 daocloud 来加速下载：

# Replace with tag you want to pull
TAG=v0.7.3rc2
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG

为离线环境加载 Docker 镜像#

如果您想在离线环境（无互联网连接）中使用容器镜像，您需要在有互联网访问权限的环境中下载容器镜像：

导出 Docker 镜像：

# Pull the image on a machine with internet access
TAG=v0.11.0
docker pull quay.io/ascend/vllm-ascend:$TAG

# Export the image to a tar file and compress to tar.gz
docker save quay.io/ascend/vllm-ascend:$TAG | gzip > vllm-ascend-$TAG.tar.gz

在没有互联网访问权限的环境中导入 Docker 镜像：

# Transfer the tar/tar.gz file to the offline environment and load it
TAG=v0.11.0
docker load -i vllm-ascend-$TAG.tar.gz

# Verify the image is loaded
docker images | grep vllm-ascend

3. What models does vllm-ascend support?#

更多详细信息请参见此处。

4. How to get in touch with our community?#

您可以通过多种渠道与我们的社区开发者和用户交流：

提交一个 GitHub issue。
参加我们的每周例会并分享您的想法。
加入我们的微信群并提出您的问题。
加入 vLLM 论坛中的 Ascend 频道并发布您的主题。

5. What features does vllm-ascend V1 support?#

更多详细信息请参见此处。

6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?#

根本原因是 NPU 环境未正确配置。您可以：

尝试执行 source /usr/local/Ascend/nnal/atb/set_env.sh 以启用 NNAL 包。
尝试执行 source /usr/local/Ascend/ascend-toolkit/set_env.sh 以启用 CANN 包。
尝试执行 npu-smi info 来检查 NPU 是否正常工作。

如果以上所有步骤都不起作用，您可以尝试以下 Python 代码来检查是否有错误：

import torch
import torch_npu
import vllm

如果问题仍然存在，请随时提交 GitHub issue。

7. How does vllm-ascend perform?#

目前，部分模型的性能得到了提升，例如 Qwen2.5 VL、Qwen3 和 Deepseek V3。从 0.9.0rc2 版本开始，Qwen 和 DeepSeek 使用图模式以提供良好的性能。此外，您还可以在 vllm-ascend v0.7.3 上安装 mindie-turbo 来加速推理。

8. How does vllm-ascend work with vllm?#

vllm-ascend 是 vllm 的一个插件。基本上，vllm-ascend 的版本与 vllm 的版本相同。例如，如果您使用 vllm 0.7.3，那么您也应该使用 vllm-ascend 0.7.3。对于主分支，我们将确保每次提交时 vllm-ascend 和 vllm 都是兼容的。

9. Does vllm-ascend support the prefill-decode disaggregation feature?#

目前，V0 引擎仅支持 1P1D。对于 V1 引擎或 NPND 支持，我们将在未来使其稳定并得到 vllm-ascend 的支持。

10. Does vllm-ascend support quantization methods?#

目前，W8A8 量化已在 v0.8.4rc2 或更高版本的 vllm-ascend 中原生支持。如果您使用 vllm 0.7.3，通过集成 vllm-ascend 和 mindie-turbo 也支持 W8A8 量化，请使用 pip install vllm-ascend[mindie-turbo]。

11. How to run a W8A8 DeepSeek model?#

按照推理教程操作，并将模型替换为 DeepSeek。

13. How is vllm-ascend tested?#

vllm-ascend 在功能、性能和准确性三个方面进行测试。

功能测试：我们添加了 CI，包括 vllm 的部分原生单元测试和 vllm-ascend 自己的单元测试。在 vllm-ascend 的测试中，我们通过 E2E 测试来测试基本功能、热门模型的可用性以及支持的功能。
性能测试：我们提供了用于 E2E 性能基准测试的基准测试工具，可以轻松在本地重新运行。我们将发布一个性能网站来展示每个拉取请求的性能测试结果。
准确性测试：我们也在努力将准确性测试添加到 CI 中。

最后，对于每个版本，我们将在未来发布性能测试和准确性测试报告。

14. How to fix the error "InvalidVersion" when using vllm-ascend?#

该问题通常是由于安装了开发版或可编辑版本的 vLLM 包引起的。在这种情况下，我们提供了环境变量 VLLM_VERSION，让用户指定要使用的 vLLM 包版本。请将环境变量 VLLM_VERSION 设置为您已安装的 vLLM 包的版本。VLLM_VERSION 的格式应为 X.Y.Z。

15. How to handle the out-of-memory issue?#

当模型超出单个 NPU 的内存容量时，通常会发生 OOM（内存不足）错误。有关一般指导，您可以参考 vLLM OOM 故障排除文档。

在 NPU 的高带宽内存（HBM）容量有限的场景下，推理过程中的动态内存分配/释放会加剧内存碎片，从而导致 OOM。为了解决这个问题：

调整 --gpu-memory-utilization：如果未指定，默认值为 0.9。您可以降低此值以保留更多内存，从而降低碎片风险。详见：vLLM - 推理与服务 - 引擎参数。
配置 PYTORCH_NPU_ALLOC_CONF：设置此环境变量以优化 NPU 内存管理。例如，您可以使用 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True 来启用虚拟内存功能，以减轻运行时频繁动态调整内存大小导致的内存碎片问题。详见：PYTORCH_NPU_ALLOC_CONF。

16. Failed to enable NPU graph mode when running DeepSeek.#

如果在启用 NPU 图模式的情况下运行 DeepSeek，您可能会遇到以下错误。当同时启用 MLA 和图模式时，每个 KV 允许的查询数为 {32, 64, 128}。因此这不支持 DeepSeek-V2-Lite，因为它只有 16 个注意力头。DeepSeek-V2-Lite 的 NPU 图模式支持将在未来实现。

如果您使用的是 DeepSeek-V3 或 DeepSeek-R1，请确保在张量并行拆分后，num_heads/num_kv_heads 的值为 {32, 64, 128} 中的一个。

[rank0]: RuntimeError: EZ9999: Inner Error!
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]

17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend.#

使用 pip 从源码重新安装 vllm-ascend 时，您可能会遇到 C 编译失败的问题。如果安装失败，请使用 python setup.py install（推荐）进行安装，或使用 python setup.py clean 清除缓存。

18. How to generate deterministic results when using vllm-ascend?#

有几个因素会影响输出的确定性：

采样方法：通过在 SamplingParams 中设置 temperature=0 来使用 贪婪采样，例如：

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

设置以下环境参数：

export LCCL_DETERMINISTIC=1
export HCCL_DETERMINISTIC=true
export ATB_MATMUL_SHUFFLE_K_ENABLE=0
export ATB_LLM_LCOC_ENABLE=0

19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for the Qwen2.5-Omni model？#

Qwen2.5-Omni 模型需要安装 librosa 包，您需要安装 qwen-omni-utils 包以确保满足所有依赖关系 pip install qwen-omni-utils。此包将安装 librosa 及其相关依赖项，解决 ImportError: No module named 'librosa' 问题，并确保音频处理功能正常工作。

20. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?#

error example in detail: 
ERROR 09-26 10:48:07 [model_runner_v1.py:3029] ACLgraph sizes capture fail: RuntimeError:
ERROR 09-26 10:48:07 [model_runner_v1.py:3029] ACLgraph has insufficient available streams to capture the configured number of sizes.Please verify both the availability of adequate streams and the appropriateness of the configured size count.

推荐的缓解策略：

手动配置 compilation_config 参数，使用缩减后的尺寸集合：'{"cudagraph_capture_sizes":[size1, size2, size3, ...]}'。
采用 ACLGraph 的全图模式作为分段方法的替代方案。

根本原因分析：当前尺寸捕获的流需求计算仅考虑了可测量因素，包括：数据并行大小、张量并行大小、专家并行配置、分段图数量、多流重叠共享专家设置以及 HCCL 通信模式（AIV/AICPU）。然而，许多不可量化的因素，例如算子特性和特定硬件特性，在此计算框架之外消耗了额外的流，导致尺寸捕获操作期间流资源耗尽。

21. Installing vllm-ascend will overwrite the existing torch-npu package.#

安装 vllm-ascend 将覆盖现有的 torch-npu 包。如果您需要安装特定版本的 torch-npu，可以在安装 vllm-ascend 后手动安装指定版本的 torch-npu。