FAQs#

Version Specific FAQs#

General FAQs#

1. What devices are currently supported?#

Currently, ONLY Atlas A2 series (Ascend-cann-kernels-910b) are supported:

  • Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)

  • Atlas A3 Training series

  • Atlas 800I A2 Inference series (Atlas 800I A2)

Below series are NOT supported yet:

  • Atlas 300I Duo、Atlas 300I Pro (Ascend-cann-kernels-310p) might be supported on 2025.Q2

  • Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet

  • Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet

From a technical view, vllm-ascend support would be possible if the torch-npu is supported. Otherwise, we have to implement it by using custom ops. We are also welcome to join us to improve together.

2. How to get our docker containers?#

You can get our containers at Quay.io, e.g., vllm-ascend and cann.

If you are in China, you can use daocloud to accelerate your downloading:

# Replace with tag you want to pull
TAG=v0.9.1
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG

3. What models does vllm-ascend supports?#

Find more details here.

4. How to get in touch with our community?#

There are many channels that you can communicate with our community developers / users:

  • Submit a GitHub issue.

  • Join our weekly meeting and share your ideas.

  • Join our WeChat group and ask your quenstions.

  • Join our ascend channel in vLLM forums and publish your topics.

5. What features does vllm-ascend V1 supports?#

Find more details here.

6. How to get better performance in Non-MLA LLMs?#

Non-MLA LLMs forcibly disable the chunked prefill feature, as the performance of operators supporting this feature functionality is currently suboptimal. Therefore, in this scenario, we enforce the Ascend scheduler and forcibly disable chunked prefill. It is important to note that when you launch a non-MLA model with a simple script, the underlying behavior deviates from vLLM’s default of enabling chunked prefill: chunked prefill is effectively turned off, and prefill and decode are scheduled separately. Consequently, inference performance may drop significantly compared to expectations.

Accordingly, we recommend the following serving configuration to achieve optimal performance on a single node:

  1. We recommend --max-model-len to a value just slightly larger than max_input_len + max_output_len; this reserves more KV-cache allocation headroom and reduces the risk of OOM.

  2. We recommend aligning --max-num-batched-tokens with –-max-model-len, or setting it a few times larger than the average input length in your dataset; this helps maintain a good load balance between prefill and decode phases.

7. How to solve the problem of “Failed to infer device type” or “libatb.so: cannot open shared object file”?#

Basically, the reason is that the NPU environment is not configured correctly. You can:

  1. try source /usr/local/Ascend/nnal/atb/set_env.sh to enable NNAL package.

  2. try source /usr/local/Ascend/ascend-toolkit/set_env.sh to enable CANN package.

  3. try npu-smi info to check whether the NPU is working.

If all above steps are not working, you can try the following code with python to check whether there is any error:

import torch
import torch_npu
import vllm

If all above steps are not working, feel free to submit a GitHub issue.

8. How does vllm-ascend perform?#

Currently, only some models are improved. Such as Qwen2.5 VL, Qwen3, Deepseek  V3. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance.

9. How vllm-ascend work with vllm?#

vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.9.1, you should use vllm-ascend 0.9.1 as well. For main branch, we will make sure vllm-ascend and vllm are compatible by each commit.

10. Does vllm-ascend support Prefill Disaggregation feature?#

Yes, Prefill Disaggregation feature is supported on V1 Engine for NPND support.

11. Does vllm-ascend support quantization method?#

w8a8 and w4a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher,

12. How to run w8a8 DeepSeek model?#

Please following the inferencing tutorail and replace model to DeepSeek.

13. How vllm-ascend is tested#

vllm-ascend is tested by functional test, performance test and accuracy test.

  • Functional test: we added CI, includes portion of vllm’s native unit tests and vllm-ascend’s own unit tests,on vllm-ascend’s test, we test basic functionality、popular models availability and supported features via e2e test

  • Performance test: we provide benchmark tools for end-to-end performance benchmark which can easily to re-route locally, we’ll publish a perf website to show the performance test results for each pull request

  • Accuracy test: we’re working on adding accuracy test to CI as well.

Final, for each release, we’ll publish the performance test and accuracy test report in the future.

14. How to fix the error “InvalidVersion” when using vllm-ascend?#

It’s usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable VLLM_VERSION to let users specify the version of vLLM package to use. Please set the env variable VLLM_VERSION to the version of vLLM package you have installed. The format of VLLM_VERSION should be X.Y.Z.

15. How to handle Out Of Memory?#

OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to vLLM’s OOM troubleshooting documentation.

In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:

  • Adjust --gpu-memory-utilization: If unspecified, will use the default value of 0.9. You can decrease this param to reserve more memory to reduce fragmentation risks. See more note in: vLLM - Inference and Serving - Engine Arguments.

  • Configure PYTORCH_NPU_ALLOC_CONF: Set this environment variable to optimize NPU memory management. For example, you can export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: PYTORCH_NPU_ALLOC_CONF.

16. Failed to enable NPU graph mode when running DeepSeek?#

You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, Thus this is not supported for DeepSeek-V2-Lite, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.

And if you’re using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.

[rank0]: RuntimeError: EZ9999: Inner Error!
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]

17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?#

You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use python setup.py install to install, or use python setup.py clean to clear the cache.

18. How to generate determinitic results when using vllm-ascend?#

There are several factors that affect output certainty:

  1. Sampler Method: using Greedy sample by setting temperature=0 in SamplingParams, e.g.:

import os
os.environ["VLLM_USE_V1"] = "1"

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
  1. Set the following enveriments parameters:

export LCCL_DETERMINISTIC=1
export HCCL_DETERMINISTIC=true
export ATB_MATMUL_SHUFFLE_K_ENABLE=0
export ATB_LLM_LCOC_ENABLE=0

19. How to fix the error “ImportError: Please install vllm[audio] for audio support” for Qwen2.5-Omni model?#

The Qwen2.5-Omni model requires the librosa package to be installed, you need to install the qwen-omni-utils package to ensure all dependencies are met pip install qwen-omni-utils, this package will install librosa and its related dependencies, resolving the ImportError: No module named 'librosa' issue and ensuring audio processing functionality works correctly.

20. Failed to run with ray distributed backend?#

You might facing the following errors when running with ray backend in distributed scenarios:

TypeError: can't convert npu:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
AttributeError: 'str' object has no attribute 'DESCRIPTOR' when packaging message to dict

This has been solved in ray>=2.47.1, thus we could solve this as following:

python3 -m pip install modelscope 'ray>=2.47.1' 'protobuf>3.20.0'

21. Failed with inferencing Qwen3 MoE due to Alloc sq cq fail issue?#

When running Qwen3 MoE with tp/dp/ep, etc., you may encounter an error shown in #2629.

This is more likely to happen when you’re using A3. Please refer to the empirical formula below to estimate a suitable value for this argument:

# pg_num: the number of process groups for communication
pg_num = sum(size > 1 for size in [
    parallel_config.data_parallel_size,
    parallel_config.tensor_parallel_size,
])
# num_hidden_layer: number of hidden layers of the model

# for A2:
num_capture_sizes = (1920) / (num_hidden_layer + 1) / (1 + pg_num * 1)
# for A3:
num_capture_sizes = (1920 - pg_num * 40) / (num_hidden_layer + 1) / (1 + pg_num * 2)

Find more details about how to calculate this value at #2629.

Try to adjust the arg cuda-capture-sizes to address this:

vllm serve ... \
--cuda-capture-sizes=num_capture_sizes