# Atlas 300I DUO

## Running vLLM on Atlas 300I DUO

### Notes

* The current release supports `FULL_DECODE_ONLY` graph mode on Atlas 300I DUO devices, but the following limitations apply due to hardware event-id resource constraints:

    * When multiple Tensor Parallel (TP) ranks are enabled, the number of capturable graphs is limited and depends on the model depth. For example, Qwen3-32B can capture and replay 2 graphs.
    * There is no such limitation when TP=1.
    * We have reached out to the relevant experts for a solution. A software-based fix is considered feasible, but full support will take additional time. Thank you for your understanding.

* Atlas 300I DUO does not support `triton` or `triton-ascend`.

* If installing from source, `vllm` and `vllm-ascend` will automatically pull in `triton` and `triton-ascend` dependencies, which may cause unexpected issues on Atlas 300I DUO. Please run:

```bash
pip uninstall -y triton triton-ascend
# If you still encounter errors mentioning triton, manually remove the remaining triton directory in site-packages,
# as uninstalling triton may leave residual files behind.
# For example: rm -rf /usr/local/python3.11.10/lib/python3.11/site-packages/triton
```

### Deployment

```{warning}
For Atlas 300I DUO (310P), do not rely on automatic `max-model-len` detection
(that is, do not omit the `--max-model-len` argument), or OOM may occur.

Reason (current 310P attention path):
- `AscendAttentionMetadataBuilder310` passes `model_config.max_model_len`
  to `AttentionMaskBuilder310`.
- `AttentionMaskBuilder310` builds a full float16 causal mask with shape
  `[max_model_len, max_model_len]`,
  and then converts it to FRACTAL_NZ format.
- In the 310P `attention_v1` prefill/chunked-prefill path
  (`_npu_flash_attention` / `_npu_paged_attention_splitfuse`),
  this explicit mask tensor is used directly, and there is currently
  no compressed-mask path.

If automatic parsing resolves to a large context length, allocating this mask
(`O(max_model_len^2)`) may exceed NPU memory and trigger OOM.
Be sure to set an explicit and conservative value, such as `--max-model-len 16384`.
```

Run the Docker container:

```{code-block} bash
   :substitutions:

# Use the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0-310p
docker run --rm \
--name vllm-ascend \
--shm-size=10g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```

### Note

The high performance is implemented based on the latest CANN community edition and the new PTA version. Therefore, you need to manually replace the CANN version with CANN 9.0.0 and torch_npu.The following uses Ubuntu as an example to describe how to install CANN. For details, see the following steps:
[Procedure](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/900/softwareinst/instg/instg_0090.html?OS=Ubuntu&InstallType=localpack) Install the new PTA version. pip install torch_npu==2.9.0.post2 This step will be supported in v0.18.0.post and later versions. You can ignore it when it is supported.

Run the following steps to start the vLLM service on NPU for the Qwen3 Dense series:

* Prepare the environment

    * Obtain model weights
        (`W8A8SC` weights will be uploaded to the Eco-Tech official ModelScope repository later.)

        * This guide requires `W8A8SC` quantized weights for the Qwen3 Dense `8B/14B/32B` models. You need to generate the SC-compressed weights yourself.
        * First, prepare the `W8A8S` weights:

            * Qwen3-8B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-8B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-8B-w8a8s-310)
            * Qwen3-14B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-14B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-14B-w8a8s-310)
            * Qwen3-32B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-32B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-32B-w8a8s-310)

        Note: if you want to validate directly with `w8a8s` weights instead of `w8a8sc` weights, the following example shows the serving command for `Qwen3-8B-w8a8s-310`. Performance is slightly lower than with compressed `w8a8sc` weights. Detailed `w8a8sc` testing is covered in the following sections.

        ```bash
        vllm serve Eco-Tech/Qwen3-8B-w8a8s-310 --host 127.0.0.1 --port 8080 \
            --tensor-parallel-size 1 --gpu_memory_utilization 0.90 \
            --served_model_name qwen --dtype float16 \
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
            --quantization ascend --max-model-len 16384
        # `--load_format` is required only for the W8A8SC quantized weight format.
        #
        ```

    * Compress the weights

        * Uninstall triton (unsupported on 310P):

            ```bash
            pip uninstall triton
            pip uninstall triton-ascend
            ```

        * Get the compression script:

            * [https://github.com/vllm-project/vllm-ascend/blob/main/examples/save_sharded_state_310.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/save_sharded_state_310.py)

        * Install the compression tool

            * Repository: [https://gitcode.com/Ascend/msit.git](https://gitcode.com/Ascend/msit.git)
            * Installation guide: [https://gitcode.com/Ascend/msit/blob/master/msmodelslim/docs/安装指南.md#基于atlas-300i-duo-系列产品安装](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/docs/安装指南.md#基于atlas-300i-duo-系列产品安装)

        * Compression command

            ```bash
            export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
            export LD_LIBRARY_PATH=/usr/local/python3.11.10/lib/:$LD_LIBRARY_PATH

            python save_sharded_state_310.py \
                --model /your-load-path/w8a8s-weight \
                --tensor-parallel-size 1 \
                --output /your-save-path/w8a8sc-weight \
                --enable-compress \
                --compress-process-num 4 \
                --enforce-eager \
                --dtype float16 \
                --quantization ascend \
                --max-model-len 10240
            ```

            Argument notes: `--tensor-parallel-size`: `W8A8SC` quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. `--model` is the path to the input `w8a8s` weights, and `--output` is the output path for the compressed `w8a8sc` weights.

        * Additional notes

            * The Qwen3-8B model has fewer parameters, so some layers need fallback handling during quantization. It is recommended to download the `qwen3-8B-w8a8sc` weights directly from the Eco-Tech official ModelScope repository once available.

* Examples

    * Qwen3-8B-w8a8sc example

        ```bash
        vllm serve /your-save-path/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1/ \
            --host 127.0.0.1 \
            --port 8080 \
            --tensor-parallel-size 1 \
            --gpu_memory_utilization 0.90 \
            --max_num_seqs 32 \
            --served_model_name qwen \
            --dtype float16 \
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
            --quantization ascend \
            --max-model-len 16384 \
            --no-enable-prefix-caching \
            --load_format="sharded_state"
        ```

    * Qwen3-14B-w8a8sc example

        ```bash
        vllm serve /your-save-path/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1/ \
            --host 127.0.0.1 \
            --port 8080 \
            --tensor-parallel-size 1 \
            --gpu_memory_utilization 0.90 \
            --max_num_seqs 16 \
            --served_model_name qwen \
            --dtype float16 \
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
            --quantization ascend \
            --max-model-len 16384 \
            --no-enable-prefix-caching \
            --load_format="sharded_state"
        ```

    * Qwen3-32B-w8a8sc example

        ```bash
        export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3

        vllm serve /save-path/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4/ \
            --host 127.0.0.1 \
            --port 8080 \
            --tensor-parallel-size 4 \
            --gpu_memory_utilization 0.90 \
            --max_num_seqs 32 \
            --served_model_name qwen \
            --dtype float16 \
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
            --quantization ascend \
            --max-model-len 20480 \
            --no-enable-prefix-caching \
            --load_format="sharded_state"
        ```

* Closing notes

  For early access to Qwen3-MoE, Qwen3-VL, and preview support for Qwen3.5 and Qwen3.6 with performance acceleration, follow #7394 for updated deployment guidance.
