Qwen3-Omni-30B-A3B-Thinking¶

1 引言¶

Qwen3-Omni 是一个原生端到端的多语言全模态基础模型。它可以处理文本、图像、音频和视频，并以文本和自然语音的形式提供实时流式响应。我们引入了多项架构升级以提升性能和效率。Qwen3-Omni-30B-A3B 的 Thinking 模型包含思考者组件，具备思维链推理能力，支持音频、视频和文本输入，并输出文本。

本文档将展示该模型的主要验证步骤，包括支持的特性、特性配置、环境准备、单节点部署、精度和性能评估。

Qwen3-Omni-30B-A3B 模型首次在 v0.12.0rc1 版本中得到支持。本文档基于 vLLM-Ascend v0.22.1rc 版本进行验证和编写。所有 v0.22.1rc 及更高版本均可稳定运行。如需使用最新特性，建议使用最新的候选发布版或正式版。

2 支持的特性¶

请参考支持的特性列表获取该模型支持的特性矩阵。

请参考特性指南获取特性的配置方法。

3 前提条件¶

3.1 模型权重¶

以下模型变体可用。建议将模型权重下载到所有节点均可访问的共享目录中。

模型	硬件要求	下载链接
Qwen3-Omni-30B-A3B (BF16)	Atlas 800I A3 (64G, 1~2卡) Atlas 800I A2 (64G, 2~4卡)	下载
Qwen3-Omni-30B-A3B-W8A8	Atlas 800I A3 (64G, 1~2卡) Atlas 800I A2 (64G, 2~4卡)	不适用

W8A8 量化权重无法直接下载，您可以通过使用 msmodelslim 对 BF16 模型进行量化来获取。详情请参考量化指南。本文档中的所有模型路径均应替换为您实际的本地路径。

这些是推荐的卡数，可根据实际情况进行调整。

Note

Qwen3-Omni-30B-A3B-W8A8 采用混合量化策略（按模型结构排序）：

嵌入层：BF16（未量化）
Q/K 归一化 (q_norm, k_norm)：BF16
注意力投影 (q/k/v/o_proj)：静态 W8A8，使用预计算的逐张量缩放因子
MoE 路由门控 (mlp.gate)：BF16
MoE 专家投影 (gate/up/down_proj)：动态 W8A8，输入缩放因子在推理过程中实时计算

建议将模型权重下载到跨多个节点的共享目录中。

4 安装¶

4.1 Docker 镜像安装¶

您可以使用 Qwen3-Omni MoE 模型的官方一体化 Docker 镜像。

Docker 拉取：

docker pull quay.io/ascend/vllm-ascend:v0.22.1rc1

Docker 运行：

Atlas 800I A3Atlas 800I A2

export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1

docker run \
    --name vllm-ascend-env \
    --shm-size=128g \
    --net=host \
    --privileged=true \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci8 \
    --device /dev/davinci9 \
    --device /dev/davinci10 \
    --device /dev/davinci11 \
    --device /dev/davinci12 \
    --device /dev/davinci13 \
    --device /dev/davinci14 \
    --device /dev/davinci15 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /etc/hccn.conf:/etc/hccn.conf \
    -it -d $IMAGE bash

Note

A3 有 8 个 NPU，采用双晶片设计（共 16 个芯片：/dev/davinci[0-15]）。如果您在共享机器上，请仅映射您需要的芯片（例如，NPU 0-3 映射 /dev/davinci[0-7]）。

export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1

docker run \
    --name vllm-ascend-env \
    --shm-size=128g \
    --net=host \
    --privileged=true \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /etc/hccn.conf:/etc/hccn.conf \
    -it -d $IMAGE bash

默认工作目录是 /workspace。vLLM 和 vLLM-Ascend 作为 Python 包安装在 site-packages 中。

安装验证：

启动容器后，运行以下命令验证安装：

docker ps | grep vllm-ascend-env

预期结果：容器被列出，状态为 Up。您还可以在容器内验证 vllm-ascend 版本：

pip show vllm-ascend

预期结果：显示版本信息，与拉取的镜像版本一致。

4.2 源码安装¶

如果您不想使用 Docker 镜像，可以从源码构建。首先从源码安装 vLLM：

克隆并安装 vLLM：

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

克隆并安装 vLLM-Ascend 仓库：

git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e .

安装验证：

pip show vllm vllm-ascend

预期结果：显示两个包的版本信息，确认安装成功。

Note

如果部署多节点环境，请在每个节点上设置环境。

请安装系统依赖项。

pip install qwen_omni_utils modelscope
# Used for audio processing.
apt-get update && apt-get install -y ffmpeg
# Check the installation.
ffmpeg -version

需要此操作以避免默认 FFTS+ 模式的流和形状限制导致的 HcclAllreduce 失败。

export HCCL_OP_EXPANSION_MODE="AIV"

5 在线服务部署¶

注意：由于该模型参数较少，不涉及 PD 分离场景。

5.1 单节点在线部署¶

单节点部署在同一节点内完成 Prefill 和 Decode，适用于开发、测试及中小规模推理场景。对于 Qwen3-Omni-30B-A3B MoE 模型，需要使用专家并行（EP）将专家分布到多个 NPU 上。

以下命令是示例配置。请根据您的实际场景调整参数。

Atlas 800I A2/A3：

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export HCCL_OP_EXPANSION_MODE="AIV"  # not needed on A2
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve your_model_path \
    --served-model-name qwen3-omni \
    --trust-remote-code \
    --max-num-seqs 100 \
    --max-model-len 40960 \
    --max-num-batched-tokens 16384 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --quantization ascend \
    --distributed_executor_backend "mp" \
    --no-enable-prefix-caching \
    --async-scheduling \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
    --gpu-memory-utilization 0.95 \
    --additional-config '{"enable_flashcomm1": false, "weight_nz_mode": 2}' \
    --port 8000

Note

ASCEND_RT_VISIBLE_DEVICES：必须设置为分配给您的环境的NPU芯片ID（例如，4个芯片设为0,1,2,3）。
--port：调整端口以避免与同一机器上运行的其他服务冲突。
--no-enable-prefix-caching：默认禁用，因为该模型在Ascend NPU上的前缀缓存效果尚未充分验证。您可以尝试启用它以评估工作负载的缓存命中率。
--quantization ascend：W8A8量化模型必需。使用BF16权重时请移除该参数。

Tip

参数详情请参考：

vLLM CLI文档 — 标准服务参数（--host、--port、--max-model-len等）
环境变量 — Ascend专用环境变量（HCCL_*等）
附加配置 — --additional-config格式及选项

服务验证：

如果服务启动成功，将显示以下启动日志：

(APIServer pid=<pid>) INFO:     Started server process [<pid>]
(APIServer pid=<pid>) INFO:     Waiting for application startup.
(APIServer pid=<pid>) INFO:     Application startup complete.

6 功能验证¶

服务器启动后，您可以使用输入提示查询模型。

curl http://localhost:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-omni",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"
                    }
                },
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"
                    }
                },
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
                    }

                },
                {
                    "type": "text",
                    "text":  "Analyze this audio, image, and video together."
                }
            ]
        }
    ]
}'

输出应类似于以下内容。

{
  "id": "chatcmpl-8f3a1c2e4b5d4f7a9c0e1d2b3a4f5e6d",
  "object": "chat.completion",
  "created": 1782714240,
  "model": "Qwen/Qwen3-Omni-30B-A3B-Thinking",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "Let me examine each modality. The image shows several cars parked in a row — looks like a street or lot, daytime. The audio is short and sharp, sounds like a human cough (a single dry cough). The video appears to show someone drawing/sketching on paper, a hand moving a pen. Now I need to connect them: there isn't a single obvious narrative linking cars + cough + drawing, so I should describe each and then note any plausible shared context (e.g., a person sketching cars while having a cold). I'll keep the synthesis honest rather than forcing a connection.",
        "content": "Here's an analysis of the three inputs together:\n\n**Image** — A row of parked cars photographed outdoors in daylight, likely on a street or in a parking area. The vehicles are of varying colors and models, shot at a slight angle so several are visible in perspective.\n\n**Audio** — A brief, dry human cough: a single forceful exhalation lasting under a second, with no speech or background music. It sounds like an isolated cough rather than a sustained coughing fit.\n\n**Video** — A close-up of a hand drawing/sketching with a pen on paper, the strokes building up a simple illustration over a few seconds.\n\n**Combined interpretation** — The three clips don't share an explicit storyline; they're distinct samples of vision, sound, and motion. If a connecting context is assumed, one plausible scene is a person sketching cars (the drawing in the video, the cars in the image) while momentarily coughing (the audio) — e.g., an artist working outdoors who has a cold. But strictly, each input stands on its own: a static photo of cars, a one-off cough sound, and a short hand-drawing clip.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 8423,
    "total_tokens": 8712,
    "completion_tokens": 289,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

预期结果：HTTP 200，返回包含choices字段的JSON响应，其中包含生成的文本。

7 精度评估¶

使用EvalScope¶

以gsm8k、omnibench、bbh数据集作为测试数据集为例，在线模式下对Qwen3-Omni-30B-A3B-Thinking进行精度评估。

参考使用evalscope安装evalscope。

运行evalscope执行精度评估。

evalscope eval \
    --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-Omni-30B-A3B-Thinking \
    --api-url http://localhost:8000/v1 \
    --api-key EMPTY \
    --eval-type server \
    --datasets omni_bench, gsm8k, bbh \
    --dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
    --eval-batch-size 1 \
    --generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \
    --limit 100

执行后可获取结果，以下为vllm-ascend:0.13.0rc1中Qwen3-Omni-30B-A3B-Thinking的结果，仅供参考。

+-----------------------------+------------+----------+----------+-------+---------+---------+
| Model                       | Dataset    | Metric   | Subset   |   Num |   Score | Cat.0   |
+=============================+============+==========+==========+=======+=========+=========+
| Qwen3-Omni-30B-A3B-Thinking | omni_bench | mean_acc | default  |   100 |    0.44 | default |
+-----------------------------+------------+----------+----------+-------+---------+---------+ 
| Qwen3-Omni-30B-A3B-Thinking | gsm8k      | mean_acc | main     |   100 |    0.98 | default |
+-----------------------------+-----------+----------+----------+-------+---------+---------+
| Qwen3-Omni-30B-A3B-Thinking | bbh        | mean_acc | OVERALL  |   270 |  0.9148 |         |
+-----------------------------+------------+----------+----------+-------+---------+---------+

8 性能评估¶

使用vLLM基准测试¶

以Qwen3-Omni-30B-A3B-Thinking为例进行性能评估。更多详情请参考vllm基准测试。更多详情请参考vLLM基准测试。

vllm bench有三个子命令：

latency：对单批请求的延迟进行基准测试。
serve：对在线服务吞吐量进行基准测试。
throughput：对离线推理吞吐量进行基准测试。

以serve为例，运行代码如下。

export VLLM_USE_MODELSCOPE=True
export MODEL=Qwen/Qwen3-Omni-30B-A3B-Thinking
python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-size 2 --swap-space 16 --disable-log-stats --disable-log-request --load-format dummy

pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r vllm-ascend/benchmarks/requirements-bench.txt

vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./

执行后可获取结果，以下为vllm-ascend:0.13.0rc1中Qwen3-Omni-30B-A3B-Thinking的结果，仅供参考。

============ Serving Benchmark Result ============
Successful requests:                     200
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  211.90
Total input tokens:                      40000
Total generated tokens:                  25600
Request throughput (req/s):              0.94
Output token throughput (tok/s):         120.81
Peak output token throughput (tok/s):    216.00
Peak concurrent requests:                24.00
Total token throughput (tok/s):          309.58
---------------Time to First Token----------------
Mean TTFT (ms):                          215.50
Median TTFT (ms):                        211.51
P99 TTFT (ms):                           317.18
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          98.96
Median TPOT (ms):                        99.19
P99 TPOT (ms):                           101.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           99.02
Median ITL (ms):                         96.10
P99 ITL (ms):                            176.02
==================================================

9 性能调优¶

9.1 推荐配置¶

注意：以下配置在特定测试环境中验证，仅供参考。最佳配置取决于最大输入/输出长度、前缀缓存命中率、精度要求及部署机器配比等因素。建议根据实际情况参考第9.2节进行调优。

表1：场景概览¶

场景	部署模式	*NPU总数	权重版本	关键注意事项
高吞吐量	单节点 (TP1)	1 (A3) 2 (A2)	W8A8	单卡部署最大化并发请求处理能力
低延迟	单节点 (TP4)	2 (A3) 4 (A2)	W8A8	多卡TP通过专家并行降低每token延迟
长上下文	单节点 (TP4)	2 (A3) 4 (A2)	W8A8	减少并发序列数以支持更长的max-model-len

*NPU总数表示所有节点使用的NPU总数。在Atlas 800I A3上，每个NPU包含两个die（芯片），因此TP4需要4个芯片=2个NPU。

表2：详细节点配置¶

场景	NPU数量	TP	max-model-len	max-num-seqs	FUSED_MC2	EP	hf-overrides
High Throughput	1 (A3)	1	37364	100	Off	Off	-
Low Latency	2 (A3)	4	37364	100	Off	On	-
Long Context	2 (A3)	4	131072	14	Off	On	YaRN

详细参数说明请参考第5节的部署示例。

低延迟配置：

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve your_model_path \
    --served-model-name qwen3-omni \
    --trust-remote-code \
    --max-num-seqs 100 \
    --max-model-len 37364 \
    --max-num-batched-tokens 16384 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --distributed_executor_backend "mp" \
    --no-enable-prefix-caching \
    --async-scheduling \
    --quantization ascend \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
    --additional-config '{"enable_flashcomm1": false, "weight_nz_mode": 2}' \
    --gpu-memory-utilization 0.95 \
    --port 8000 \
    --speculative-config '{"method": "eagle3","model": "your_eagle3_model_path", "num_speculative_tokens": 3}'

Tip

此配置的AISBench设置示例：

request_rate：0
batch_size：32
输入/输出长度：2048/2048 或 3500/1500

高吞吐配置：

export ASCEND_RT_VISIBLE_DEVICES=0
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve your_model_path \
    --served-model-name qwen3-omni \
    --trust-remote-code \
    --max-num-seqs 100 \
    --max-model-len 37364 \
    --max-num-batched-tokens 16384 \
    --tensor-parallel-size 1 \
    --distributed_executor_backend "mp" \
    --no-enable-prefix-caching \
    --async-scheduling \
    --quantization ascend \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
    --additional-config '{"weight_nz_mode": 2}' \
    --gpu-memory-utilization 0.95 \
    --port 8000 \
    --speculative-config '{"method": "eagle3","model": "your_eagle3_model_path", "num_speculative_tokens": 3}'

Tip

此配置的AISBench设置示例：

request_rate：0
batch_size：32
输入/输出长度：2048/2048 或 3500/1500

长上下文配置：

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

vllm serve your_model_path \
    --served-model-name qwen3-omni \
    --trust-remote-code \
    --max-num-seqs 14 \
    --max-model-len 131072 \
    --max-num-batched-tokens 16384 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --distributed_executor_backend "mp" \
    --no-enable-prefix-caching \
    --async-scheduling \
    --quantization ascend \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
    --additional-config '{"enable_flashcomm1": false, "weight_nz_mode": 2}' \
    --gpu-memory-utilization 0.95 \
    --port 8000 \
    --hf-overrides '{"rope_parameters": {"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}}'

Tip

此配置的AISBench设置示例：

request_rate：0
batch_size：32
输入/输出长度：65536/1024 或 131072/1024

9.2 调优指南¶

9.2.1 通用调优参考¶

请参考公开性能调优文档了解调优方法。请参考功能矩阵了解详细的功能描述。

10 常见问题¶

对于常见的环境、安装和通用参数问题，请参考公开常见问题解答。