Atlas 300I DUO#
Running vLLM on Atlas 300I DUO#
Notes#
The current release supports
FULL_DECODE_ONLYgraph mode on Atlas 300I DUO devices, but the following limitations apply due to hardware event-id resource constraints:When multiple Tensor Parallel (TP) ranks are enabled, the number of capturable graphs is limited and depends on the model depth. For example, Qwen3-32B can capture and replay 2 graphs.
There is no such limitation when TP=1.
We have reached out to the relevant experts for a solution. A software-based fix is considered feasible, but full support will take additional time. Thank you for your understanding.
Atlas 300I DUO does not support
tritonortriton-ascend.If installing from source,
vllmandvllm-ascendwill automatically pull intritonandtriton-ascenddependencies, which may cause unexpected issues on Atlas 300I DUO. Please run:
pip uninstall -y triton triton-ascend
# If you still encounter errors mentioning triton, manually remove the remaining triton directory in site-packages,
# as uninstalling triton may leave residual files behind.
# For example: rm -rf /usr/local/python3.11.10/lib/python3.11/site-packages/triton
Deployment#
警告
For Atlas 300I DUO (310P), do not rely on automatic max-model-len detection
(that is, do not omit the --max-model-len argument), or OOM may occur.
原因(当前 310P 注意力路径):
AscendAttentionMetadataBuilder310将model_config.max_model_len传递给AttentionMaskBuilder310。AttentionMaskBuilder310builds a full float16 causal mask with shape[max_model_len, max_model_len], and then converts it to FRACTAL_NZ format.In the 310P
attention_v1prefill/chunked-prefill path (_npu_flash_attention/_npu_paged_attention_splitfuse), this explicit mask tensor is used directly, and there is currently no compressed-mask path.
If automatic parsing resolves to a large context length, allocating this mask
(O(max_model_len^2)) may exceed NPU memory and trigger OOM.
Be sure to set an explicit and conservative value, such as --max-model-len 16384.
Run the Docker container:
# Use the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0-310p
docker run --rm \
--name vllm-ascend \
--shm-size=10g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
Note#
The high performance is implemented based on the latest CANN community edition and the new PTA version. Therefore, you need to manually replace the CANN version with CANN 9.0.0 and torch_npu.The following uses Ubuntu as an example to describe how to install CANN. For details, see the following steps: Procedure Install the new PTA version. pip install torch_npu==2.9.0.post2 This step will be supported in v0.18.0.post and later versions. You can ignore it when it is supported.
Run the following steps to start the vLLM service on NPU for the Qwen3 Dense series:
Prepare the environment
Obtain model weights (
W8A8SCweights will be uploaded to the Eco-Tech official ModelScope repository later.)This guide requires
W8A8SCquantized weights for the Qwen3 Dense8B/14B/32Bmodels. You need to generate the SC-compressed weights yourself.First, prepare the
W8A8Sweights:Qwen3-8B-w8a8s-310: https://modelers.cn/models/Eco-Tech/Qwen3-8B-w8a8s-310
Qwen3-14B-w8a8s-310: https://modelers.cn/models/Eco-Tech/Qwen3-14B-w8a8s-310
Qwen3-32B-w8a8s-310: https://modelers.cn/models/Eco-Tech/Qwen3-32B-w8a8s-310
Note: if you want to validate directly with
w8a8sweights instead ofw8a8scweights, the following example shows the serving command forQwen3-8B-w8a8s-310. Performance is slightly lower than with compressedw8a8scweights. Detailedw8a8sctesting is covered in the following sections.vllm serve Eco-Tech/Qwen3-8B-w8a8s-310 --host 127.0.0.1 --port 8080 \ --tensor-parallel-size 1 --gpu_memory_utilization 0.90 \ --served_model_name qwen --dtype float16 \ --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \ --quantization ascend --max-model-len 16384 # `--load_format` is required only for the W8A8SC quantized weight format. #
Compress the weights
Uninstall triton (unsupported on 310P):
pip uninstall triton pip uninstall triton-ascend
Get the compression script:
Install the compression tool
Repository: https://gitcode.com/Ascend/msit.git
Installation guide: https://gitcode.com/Ascend/msit/blob/master/msmodelslim/docs/安装指南.md#基于atlas-300i-duo-系列产品安装
Compression command
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 export LD_LIBRARY_PATH=/usr/local/python3.11.10/lib/:$LD_LIBRARY_PATH python save_sharded_state_310.py \ --model /your-load-path/w8a8s-weight \ --tensor-parallel-size 1 \ --output /your-save-path/w8a8sc-weight \ --enable-compress \ --compress-process-num 4 \ --enforce-eager \ --dtype float16 \ --quantization ascend \ --max-model-len 10240
Argument notes:
--tensor-parallel-size:W8A8SCquantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression.--modelis the path to the inputw8a8sweights, and--outputis the output path for the compressedw8a8scweights.Additional notes
The Qwen3-8B model has fewer parameters, so some layers need fallback handling during quantization. It is recommended to download the
qwen3-8B-w8a8scweights directly from the Eco-Tech official ModelScope repository once available.
Examples
Qwen3-8B-w8a8sc example
vllm serve /your-save-path/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1/ \ --host 127.0.0.1 \ --port 8080 \ --tensor-parallel-size 1 \ --gpu_memory_utilization 0.90 \ --max_num_seqs 32 \ --served_model_name qwen \ --dtype float16 \ --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \ --quantization ascend \ --max-model-len 16384 \ --no-enable-prefix-caching \ --load_format="sharded_state"
Qwen3-14B-w8a8sc example
vllm serve /your-save-path/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1/ \ --host 127.0.0.1 \ --port 8080 \ --tensor-parallel-size 1 \ --gpu_memory_utilization 0.90 \ --max_num_seqs 16 \ --served_model_name qwen \ --dtype float16 \ --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \ --quantization ascend \ --max-model-len 16384 \ --no-enable-prefix-caching \ --load_format="sharded_state"
Qwen3-32B-w8a8sc example
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 vllm serve /save-path/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4/ \ --host 127.0.0.1 \ --port 8080 \ --tensor-parallel-size 4 \ --gpu_memory_utilization 0.90 \ --max_num_seqs 32 \ --served_model_name qwen \ --dtype float16 \ --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \ --quantization ascend \ --max-model-len 20480 \ --no-enable-prefix-caching \ --load_format="sharded_state"
Closing notes
For early access to Qwen3-MoE, Qwen3-VL, and preview support for Qwen3.5 and Qwen3.6 with performance acceleration, follow #7394 for updated deployment guidance.