Atlas 300I DUO#

Running vLLM on Atlas 300I DUO#

Notes#

  • The current release supports FULL_DECODE_ONLY graph mode on Atlas 300I DUO devices, but the following limitations apply due to hardware event-id resource constraints:

    • When multiple Tensor Parallel (TP) ranks are enabled, the number of capturable graphs is limited and depends on the model depth. For example, Qwen3-32B can capture and replay 2 graphs.

    • There is no such limitation when TP=1.

    • We have reached out to the relevant experts for a solution. A software-based fix is considered feasible, but full support will take additional time. Thank you for your understanding.

  • Atlas 300I DUO does not support triton or triton-ascend.

  • If installing from source, vllm and vllm-ascend will automatically pull in triton and triton-ascend dependencies, which may cause unexpected issues on Atlas 300I DUO. Please run:

pip uninstall -y triton triton-ascend
# If you still encounter errors mentioning triton, manually remove the remaining triton directory in site-packages,
# as uninstalling triton may leave residual files behind.
# For example: rm -rf /usr/local/python3.11.10/lib/python3.11/site-packages/triton

Deployment#

警告

For Atlas 300I DUO (310P), do not rely on automatic max-model-len detection (that is, do not omit the --max-model-len argument), or OOM may occur.

原因(当前 310P 注意力路径):

  • AscendAttentionMetadataBuilder310model_config.max_model_len 传递给 AttentionMaskBuilder310

  • AttentionMaskBuilder310 builds a full float16 causal mask with shape [max_model_len, max_model_len], and then converts it to FRACTAL_NZ format.

  • In the 310P attention_v1 prefill/chunked-prefill path (_npu_flash_attention / _npu_paged_attention_splitfuse), this explicit mask tensor is used directly, and there is currently no compressed-mask path.

If automatic parsing resolves to a large context length, allocating this mask (O(max_model_len^2)) may exceed NPU memory and trigger OOM. Be sure to set an explicit and conservative value, such as --max-model-len 16384.

Run the Docker container:

# Use the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0-310p
docker run --rm \
--name vllm-ascend \
--shm-size=10g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash

Note#

The high performance is implemented based on the latest CANN community edition and the new PTA version. Therefore, you need to manually replace the CANN version with CANN 9.0.0 and torch_npu.The following uses Ubuntu as an example to describe how to install CANN. For details, see the following steps: Procedure Install the new PTA version. pip install torch_npu==2.9.0.post2 This step will be supported in v0.18.0.post and later versions. You can ignore it when it is supported.

Run the following steps to start the vLLM service on NPU for the Qwen3 Dense series:

  • Prepare the environment

    • Obtain model weights (W8A8SC weights will be uploaded to the Eco-Tech official ModelScope repository later.)

      Note: if you want to validate directly with w8a8s weights instead of w8a8sc weights, the following example shows the serving command for Qwen3-8B-w8a8s-310. Performance is slightly lower than with compressed w8a8sc weights. Detailed w8a8sc testing is covered in the following sections.

      vllm serve Eco-Tech/Qwen3-8B-w8a8s-310 --host 127.0.0.1 --port 8080 \
          --tensor-parallel-size 1 --gpu_memory_utilization 0.90 \
          --served_model_name qwen --dtype float16 \
          --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
          --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
          --quantization ascend --max-model-len 16384
      # `--load_format` is required only for the W8A8SC quantized weight format.
      #
      
    • Compress the weights

      • Uninstall triton (unsupported on 310P):

        pip uninstall triton
        pip uninstall triton-ascend
        
      • Get the compression script:

      • Install the compression tool

      • Compression command

        export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
        export LD_LIBRARY_PATH=/usr/local/python3.11.10/lib/:$LD_LIBRARY_PATH
        
        python save_sharded_state_310.py \
            --model /your-load-path/w8a8s-weight \
            --tensor-parallel-size 1 \
            --output /your-save-path/w8a8sc-weight \
            --enable-compress \
            --compress-process-num 4 \
            --enforce-eager \
            --dtype float16 \
            --quantization ascend \
            --max-model-len 10240
        

        Argument notes: --tensor-parallel-size: W8A8SC quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. --model is the path to the input w8a8s weights, and --output is the output path for the compressed w8a8sc weights.

      • Additional notes

        • The Qwen3-8B model has fewer parameters, so some layers need fallback handling during quantization. It is recommended to download the qwen3-8B-w8a8sc weights directly from the Eco-Tech official ModelScope repository once available.

  • Examples

    • Qwen3-8B-w8a8sc example

      vllm serve /your-save-path/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1/ \
          --host 127.0.0.1 \
          --port 8080 \
          --tensor-parallel-size 1 \
          --gpu_memory_utilization 0.90 \
          --max_num_seqs 32 \
          --served_model_name qwen \
          --dtype float16 \
          --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
          --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
          --quantization ascend \
          --max-model-len 16384 \
          --no-enable-prefix-caching \
          --load_format="sharded_state"
      
    • Qwen3-14B-w8a8sc example

      vllm serve /your-save-path/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1/ \
          --host 127.0.0.1 \
          --port 8080 \
          --tensor-parallel-size 1 \
          --gpu_memory_utilization 0.90 \
          --max_num_seqs 16 \
          --served_model_name qwen \
          --dtype float16 \
          --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
          --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
          --quantization ascend \
          --max-model-len 16384 \
          --no-enable-prefix-caching \
          --load_format="sharded_state"
      
    • Qwen3-32B-w8a8sc example

      export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
      
      vllm serve /save-path/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4/ \
          --host 127.0.0.1 \
          --port 8080 \
          --tensor-parallel-size 4 \
          --gpu_memory_utilization 0.90 \
          --max_num_seqs 32 \
          --served_model_name qwen \
          --dtype float16 \
          --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
          --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
          --quantization ascend \
          --max-model-len 20480 \
          --no-enable-prefix-caching \
          --load_format="sharded_state"
      
  • Closing notes

    For early access to Qwen3-MoE, Qwen3-VL, and preview support for Qwen3.5 and Qwen3.6 with performance acceleration, follow #7394 for updated deployment guidance.