FAQs#
特定版本常见问题#
常见问题解答#
1. What devices are currently supported?#
Currently, ONLY Atlas A2 series (Ascend-cann-kernels-910b) are supported:
Atlas A2 训练系列(Atlas 800T A2,Atlas 900 A2 PoD,Atlas 200T A2 Box16,Atlas 300T A2)
Atlas A3 训练系列
Atlas 800I A2 推理系列(Atlas 800I A2)
以下系列目前尚不受支持:
Atlas 300I Duo、Atlas 300I Pro (Ascend-cann-kernels-310p) might be supported on 2025.Q2
Atlas 200I A2(Ascend-cann-kernels-310b)尚未计划
Ascend 910,Ascend 910 Pro B(Ascend-cann-kernels-910)尚未计划
从技术角度来看,如果支持 torch-npu,则可以支持 vllm-ascend。否则,我们需要通过自定义算子来实现。我们也欢迎大家一起加入,共同改进。
2. How to get our docker containers?#
你可以在 Quay.io 获取我们的容器,例如,vllm-ascend 和 cann。
如果你在中国,可以使用 daocloud 来加速下载:
# Replace with tag you want to pull
TAG=v0.9.1
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG
3. What models does vllm-ascend supports?#
Find more details here.
4. How to get in touch with our community?#
你可以通过多种渠道与我们的社区开发者/用户进行交流:
5. What features does vllm-ascend V1 supports?#
Find more details here.
6. How to get better performance in Non-MLA LLMs?#
Non-MLA LLMs forcibly disable the chunked prefill feature, as the performance of operators supporting this feature functionality is currently suboptimal. Therefore, in this scenario, we enforce the Ascend scheduler and forcibly disable chunked prefill. It is important to note that when you launch a non-MLA model with a simple script, the underlying behavior deviates from vLLM’s default of enabling chunked prefill: chunked prefill is effectively turned off, and prefill and decode are scheduled separately. Consequently, inference performance may drop significantly compared to expectations.
Accordingly, we recommend the following serving configuration to achieve optimal performance on a single node:
We recommend
--max-model-lento a value just slightly larger thanmax_input_len + max_output_len; this reserves more KV-cache allocation headroom and reduces the risk of OOM.We recommend aligning
--max-num-batched-tokenswith–-max-model-len, or setting it a few times larger than the average input length in your dataset; this helps maintain a good load balance between prefill and decode phases.
8. How does vllm-ascend perform?#
Currently, only some models are improved. Such as Qwen2.5 VL, Qwen3, Deepseek V3. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance.
9. How vllm-ascend work with vllm?#
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.9.1, you should use vllm-ascend 0.9.1 as well. For main branch, we will make sure vllm-ascend and vllm are compatible by each commit.
10. Does vllm-ascend support Prefill Disaggregation feature?#
是的,Prefill Disaggregation 功能在 V1 引擎上支持 NPND 功能。
11. Does vllm-ascend support quantization method?#
w8a8 和 w4a8 量化在 vllm-ascend 中原本就支持,从 v0.8.4rc2 或更高版本开始
12. How to run w8a8 DeepSeek model?#
Please following the inferencing tutorail and replace model to DeepSeek.
13. How vllm-ascend is tested#
vllm-ascend 经过功能测试、性能测试和精度测试。
Functional test: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests,on vllm-ascend's test, we test basic functionality、popular models availability and supported features via e2e test
性能测试:我们提供了用于端到端性能基准测试的基准测试工具,可以方便地在本地重新运行。我们将发布一个性能网站,用于展示每个拉取请求的性能测试结果。
准确性测试:我们也在努力将准确性测试添加到CI中。
Final, for each release, we'll publish the performance test and accuracy test report in the future.
14. How to fix the error "InvalidVersion" when using vllm-ascend?#
这通常是因为你安装了开发版或可编辑版本的 vLLM 包。在这种情况下,我们提供了环境变量 VLLM_VERSION,以便用户指定要使用的 vLLM 包版本。请将环境变量 VLLM_VERSION 设置为你已安装的 vLLM 包的版本。VLLM_VERSION 的格式应为 X.Y.Z。
15. How to handle Out Of Memory?#
当模型超出单个 NPU 的内存容量时,通常会发生 OOM(内存溢出)错误。一般性的指导可以参考 vLLM 的 OOM 故障排除文档。
在 NPU 的 HBM(高带宽内存)容量有限的场景下,推理过程中动态内存分配和释放会加剧内存碎片,从而导致 OOM(内存溢出)。为了解决这个问题:
调整
--gpu-memory-utilization:如果未指定,将使用默认值0.9。你可以降低此参数来预留更多内存,从而降低内存碎片风险。参见更多说明:vLLM - 推理与服务 - 引擎参数。配置
PYTORCH_NPU_ALLOC_CONF:设置此环境变量以优化NPU内存管理。例如,你可以通过export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True来启用虚拟内存功能,以缓解运行时频繁动态调整内存大小导致的内存碎片问题,更多说明参见:PYTORCH_NPU_ALLOC_CONF。
16. Failed to enable NPU graph mode when running DeepSeek?#
如果在启用NPU图模式(Graph mode)运行DeepSeek时,您可能会遇到以下错误。当同时启用MLA和图模式时,每个kv允许的查询数只支持{32, 64, 128},因此这不支持DeepSeek-V2-Lite,因为它只有16个注意力头。未来会增加对DeepSeek-V2-Lite在NPU图模式下的支持。
如果你正在使用 DeepSeek-V3 或 DeepSeek-R1,请确保在张量并行切分后,num_heads / num_kv_heads 的值为 {32, 64, 128} 中的一个。
[rank0]: RuntimeError: EZ9999: Inner Error!
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?#
当你使用 pip 从源码重新安装 vllm-ascend 时,可能会遇到 C 编译失败的问题。如果安装失败,建议使用 python setup.py install 进行安装,或者使用 python setup.py clean 清除缓存。
18. How to generate determinitic results when using vllm-ascend?#
有几个因素会影响输出的确定性:
采样方法:通过在
SamplingParams中设置temperature=0来使用 贪婪采样(Greedy sample),例如:
import os
os.environ["VLLM_USE_V1"] = "1"
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
设置以下环境参数:
export LCCL_DETERMINISTIC=1
export HCCL_DETERMINISTIC=true
export ATB_MATMUL_SHUFFLE_K_ENABLE=0
export ATB_LLM_LCOC_ENABLE=0
19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model?#
The Qwen2.5-Omni model requires the librosa package to be installed, you need to install the qwen-omni-utils package to ensure all dependencies are met pip install qwen-omni-utils,
this package will install librosa and its related dependencies, resolving the ImportError: No module named 'librosa' issue and ensuring audio processing functionality works correctly.
20. Failed to run with ray distributed backend?#
You might facing the following errors when running with ray backend in distributed scenarios:
TypeError: can't convert npu:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
AttributeError: 'str' object has no attribute 'DESCRIPTOR' when packaging message to dict
This has been solved in ray>=2.47.1, thus we could solve this as following:
python3 -m pip install modelscope 'ray>=2.47.1' 'protobuf>3.20.0'
21. Failed with inferencing Qwen3 MoE due to Alloc sq cq fail issue?#
When running Qwen3 MoE with tp/dp/ep, etc., you may encounter an error shown in #2629.
This is more likely to happen when you're using A3. Please refer to the empirical formula below to estimate a suitable value for this argument:
# pg_num: the number of process groups for communication
pg_num = sum(size > 1 for size in [
parallel_config.data_parallel_size,
parallel_config.tensor_parallel_size,
])
# num_hidden_layer: number of hidden layers of the model
# for A2:
num_capture_sizes = (1920) / (num_hidden_layer + 1) / (1 + pg_num * 1)
# for A3:
num_capture_sizes = (1920 - pg_num * 40) / (num_hidden_layer + 1) / (1 + pg_num * 2)
Find more details about how to calculate this value at #2629.
Try to adjust the arg cuda-capture-sizes to address this:
vllm serve ... \
--cuda-capture-sizes=num_capture_sizes