Atlas 300I DUO#

Note

Atlas 300I DUO does not support triton or triton-ascend.

Run vLLM on Atlas 300I DUO#

Install Notes#

If installing from source, vllm and vllm-ascend may automatically pull in triton and triton-ascend dependencies, which may cause unexpected issues on Atlas 300I DUO. Please uninstall them before running on Atlas 300I DUO:

pip uninstall -y triton-ascend triton

Graph Mode Notes#

Warning

The current release supports FULL_DECODE_ONLY graph mode on Atlas 300I DUO devices, but the following limitations apply due to hardware event-id resource constraints:

  • When multiple Tensor Parallel (TP) ranks are enabled, the number of capturable graphs is limited and depends on the model depth. For example, Qwen3-32B can capture and replay 2 graphs.

  • There is no such limitation when TP=1.

  • We have reached out to the relevant experts for a solution. A software-based fix is considered feasible, but full support will take additional time. Thank you for your understanding.

Deployment#

Run docker container:

# Use the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0-310p

docker run --rm \
--name vllm-ascend \
--shm-size=10g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8080:8080 \
-it $IMAGE bash

Set up environment variables:

export VLLM_USE_MODELSCOPE=True

Online Inference on NPU#

Warning

For Atlas 300I DUO (310P), do not rely on max-model-len auto detection (that is, do not omit the --max-model-len argument), because it may cause OOM.

Reason, based on the current 310P attention path:

  • AscendAttentionMetadataBuilder310 passes model_config.max_model_len to AttentionMaskBuilder310.

  • AttentionMaskBuilder310 builds a full causal mask with shape [max_model_len, max_model_len] in float16, then converts it to FRACTAL_NZ.

  • In the 310P attention_v1 prefill/chunked-prefill path (_npu_flash_attention / _npu_paged_attention_splitfuse), this explicit mask tensor is consumed directly, and there is currently no compressed-mask path.

If auto detection resolves to a large context length, the mask allocation (O(max_model_len^2)) may exceed NPU memory and trigger OOM. Always set an explicit and conservative value, for example --max-model-len 16384.

Run the following commands to start the vLLM server on NPU for the Qwen3 Dense series.

Prepare Model Weights#

Use the W8A8SC quantized weights from the Eco-Tech official ModelScope repository.

Model

ModelScope Link

Qwen3-8B-W8A8SC-310

Eco-Tech/Qwen3-8B-w8a8sc-310-vllm

Qwen3-14B-W8A8SC-310

Eco-Tech/Qwen3-14B-w8a8sc-310-vllm

Qwen3-32B-W8A8SC-310

Eco-Tech/Qwen3-32B-w8a8sc-310-vllm

Qwen3-8B-W8A8SC#

vllm serve Eco-Tech/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1 \
    --host 127.0.0.1 \
    --port 8080 \
    --tensor-parallel-size 1 \
    --gpu_memory_utilization 0.90 \
    --max_num_seqs 32 \
    --served_model_name qwen \
    --dtype float16 \
    --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
    --quantization ascend \
    --max_model_len 16384 \
    --no-enable-prefix-caching \
    --load_format sharded_state

Qwen3-14B-W8A8SC#

vllm serve Eco-Tech/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1 \
    --host 127.0.0.1 \
    --port 8080 \
    --tensor-parallel-size 1 \
    --gpu_memory_utilization 0.90 \
    --max_num_seqs 16 \
    --served_model_name qwen \
    --dtype float16 \
    --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
    --quantization ascend \
    --max_model_len 16384 \
    --no-enable-prefix-caching \
    --load_format sharded_state

Qwen3-32B-W8A8SC#

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3

vllm serve Eco-Tech/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4 \
    --host 127.0.0.1 \
    --port 8080 \
    --tensor-parallel-size 4 \
    --gpu_memory_utilization 0.90 \
    --max_num_seqs 32 \
    --served_model_name qwen \
    --dtype float16 \
    --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
    --quantization ascend \
    --max_model_len 20480 \
    --no-enable-prefix-caching \
    --load_format sharded_state

Once the server is started, you can query the model with input prompts:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The future of AI is",
    "max_completion_tokens": 64,
    "temperature": 0.0
  }'

If the script runs successfully, you can see the generated result.

Offline Inference#

Run the following script, example.py, to execute offline inference on NPU.

import gc
import torch

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
    destroy_distributed_environment,
    destroy_model_parallel,
)


def clean_up():
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()
    torch.npu.empty_cache()


prompts = [
    "Hello, my name is",
    "The future of AI is",
]

sampling_params = SamplingParams(
    max_completion_tokens=100,
    temperature=0.0,
)

llm = LLM(
    model="Eco-Tech/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1",
    tensor_parallel_size=1,
    max_model_len=16384,
    dtype="float16",
    quantization="ascend",
    load_format="sharded_state",
    additional_config={
        "ascend_compilation_config": {
            "fuse_norm_quant": False,
        }
    },
    compilation_config={
        "cudagraph_mode": "FULL_DECODE_ONLY",
        "cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32],
    },
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

del llm
clean_up()
import gc
import torch

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
    destroy_distributed_environment,
    destroy_model_parallel,
)


def clean_up():
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()
    torch.npu.empty_cache()


prompts = [
    "Hello, my name is",
    "The future of AI is",
]

sampling_params = SamplingParams(
    max_completion_tokens=100,
    temperature=0.0,
)

llm = LLM(
    model="Eco-Tech/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1",
    tensor_parallel_size=1,
    max_model_len=16384,
    dtype="float16",
    quantization="ascend",
    load_format="sharded_state",
    additional_config={
        "ascend_compilation_config": {
            "fuse_norm_quant": False,
        }
    },
    compilation_config={
        "cudagraph_mode": "FULL_DECODE_ONLY",
        "cudagraph_capture_sizes": [1, 2, 4, 8, 16],
    },
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

del llm
clean_up()
import gc
import os
import torch

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
    destroy_distributed_environment,
    destroy_model_parallel,
)


def clean_up():
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()
    torch.npu.empty_cache()


os.environ["ASCEND_RT_VISIBLE_DEVICES"] = "0,1,2,3"

prompts = [
    "Hello, my name is",
    "The future of AI is",
]

sampling_params = SamplingParams(
    max_completion_tokens=100,
    temperature=0.0,
)

llm = LLM(
    model="Eco-Tech/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4",
    tensor_parallel_size=4,
    max_model_len=20480,
    dtype="float16",
    quantization="ascend",
    load_format="sharded_state",
    additional_config={
        "ascend_compilation_config": {
            "fuse_norm_quant": False,
        }
    },
    compilation_config={
        "cudagraph_mode": "FULL_DECODE_ONLY",
        "cudagraph_capture_sizes": [16, 32],
    },
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

del llm
clean_up()

Run script:

python example.py

If the script runs successfully, you can see the generated result.

Closing Notes#

For early access to Qwen3-MoE, Qwen3-VL, and preview support for Qwen3.5 and Qwen3.6 with performance acceleration, follow #7394 for updated deployment guidance.