Qwen3-Omni-30B-A3B-Thinking#

简介#

Qwen3-Omni 是原生端到端的多语言全模态基础模型。它可以处理文本、图像、音频和视频,并以文本和自然语音的形式提供实时流式响应。我们引入了多项架构升级以提升性能和效率。Qwen3-Omni-30B-A3B 的 Thinking 模型包含思考组件,具备思维链推理能力,支持音频、视频和文本输入,输出文本。

本文档将展示模型的主要验证步骤,包括支持的特性、特性配置、环境准备、单节点部署、精度评估和性能评估。

支持的特性#

请参阅支持的特性获取模型的支持特性矩阵。

请参阅特性指南获取特性的配置方法。

环境准备#

模型权重#

  • Qwen3-Omni-30B-A3B-Thinking 需要 2 张 NPU 卡(64G × 2)。下载模型权重建议将模型权重下载到多节点的共享目录,例如 /root/.cache/

安装#

您可以使用官方 Docker 镜像直接运行 Qwen3-Omni-30B-A3B-Thinking

根据您的机器类型选择镜像,并在节点上启动 Docker 镜像,请参考使用 Docker 安装

# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
# Update the vllm-ascend image according to your environment.
# Note you should download the weight to /root/.cache in advance.
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.20.2rc1
export NAME=vllm-ascend

# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

您可以从源码构建所有组件。

请安装系统依赖

pip install qwen_omni_utils modelscope
# Used for audio processing.
apt-get update && apt-get install -y ffmpeg
# Check the installation.
ffmpeg -version

需要此设置以避免默认 FFTS+ 模式的流和形状限制导致的 HcclAllreduce 失败。

export HCCL_OP_EXPANSION_MODE="AIV"

部署#

单节点部署#

多 NPU 离线推理#

运行以下脚本在多 NPU 上执行离线推理:

import gc
import torch
import os
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
    destroy_distributed_environment,
    destroy_model_parallel
)
from modelscope import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

os.environ["HCCL_BUFFSIZE"] = "1024"

def clean_up():
    """Clean up distributed resources and NPU memory"""
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()  # Garbage collection to free up memory
    torch.npu.empty_cache()


def main():
    MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
    llm = LLM(
        model=MODEL_PATH,
        tensor_parallel_size=2,
        enable_expert_parallel=True,
        distributed_executor_backend="mp",
        limit_mm_per_prompt={'image': 5, 'video': 2, 'audio': 3},
        max_model_len=32768,
    )

    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        max_tokens=16384,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"},
                {"type": "text", "text": "What can you see and hear? Answer in one sentence."}
            ]
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    # 'use_audio_in_video = True' requires equal number of audio and video items, including audio from the video. 
    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)

    inputs = {
        "prompt": text,
        "multi_modal_data": {},
        "mm_processor_kwargs": {"use_audio_in_video": True}
    }
    if images is not None:
        inputs['multi_modal_data']['image'] = images
    if videos is not None:
        inputs['multi_modal_data']['video'] = videos
    if audios is not None:
        inputs['multi_modal_data']['audio'] = audios

    outputs = llm.generate([inputs], sampling_params=sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

    del llm
    clean_up()


if __name__ == "__main__":
    main()

多 NPU 在线推理#

运行以下脚本在多 NPU 上启动 vLLM 服务器:对于 Atlas A2,如果 NPU 卡内存为 64 GB,tensor-parallel-size 至少应为 1;如果内存为 32 GB,tensor-parallel-size 至少应为 2。

export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 --enable_expert_parallel

功能验证#

服务器启动后,您可以使用输入提示词查询模型。

curl http://localhost:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Thinking",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"
                    }
                },
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"
                    }
                },
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
                    }

                },
                {
                    "type": "text",
                    "text":  "Analyze this audio, image, and video together."
                }
            ]
        }
    ]
}'

精度评估#

以下是精度评估方法。

使用 EvalScope#

gsm8komnibenchbbh 数据集为例,以在线模式运行 Qwen3-Omni-30B-A3B-Thinking 的精度评估。

  1. 请参考使用 evalscope(https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip)了解 evalscope 的安装方法。

  2. 运行 evalscope 执行精度评估。

    evalscope eval \
        --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-Omni-30B-A3B-Thinking \
        --api-url http://localhost:8000/v1 \
        --api-key EMPTY \
        --eval-type server \
        --datasets omni_bench, gsm8k, bbh \
        --dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
        --eval-batch-size 1 \
        --generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \
        --limit 100
    
  3. 执行后即可获得结果,以下是 Qwen3-Omni-30B-A3B-Thinking 在 vllm-ascend:0.13.0rc1 上的结果,仅供参考。

    +-----------------------------+------------+----------+----------+-------+---------+---------+
    | Model                       | Dataset    | Metric   | Subset   |   Num |   Score | Cat.0   |
    +=============================+============+==========+==========+=======+=========+=========+
    | Qwen3-Omni-30B-A3B-Thinking | omni_bench | mean_acc | default  |   100 |    0.44 | default |
    +-----------------------------+------------+----------+----------+-------+---------+---------+ 
    | Qwen3-Omni-30B-A3B-Thinking | gsm8k      | mean_acc | main     |   100 |    0.98 | default |
    +-----------------------------+-----------+----------+----------+-------+---------+---------+
    | Qwen3-Omni-30B-A3B-Thinking | bbh        | mean_acc | OVERALL  |   270 |  0.9148 |         |
    +-----------------------------+------------+----------+----------+-------+---------+---------+
    

性能#

使用 vLLM 基准测试#

Qwen3-Omni-30B-A3B-Thinking 为例运行性能评估。更多详细信息请参考 vLLM 基准测试。请参考 vLLM 基准测试

有三个 vllm bench 子命令:

  • latency:基准测试单批次请求的延迟。

  • serve:基准测试在线服务的吞吐量。

  • throughput:基准测试离线推理的吞吐量。

serve 为例,运行如下代码。

export VLLM_USE_MODELSCOPE=True
export MODEL=Qwen/Qwen3-Omni-30B-A3B-Thinking
python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-size 2 --swap-space 16 --disable-log-stats --disable-log-request --load-format dummy

pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r vllm-ascend/benchmarks/requirements-bench.txt

vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./

执行后即可获得结果,以下是 Qwen3-Omni-30B-A3B-Thinking 在 vllm-ascend:0.13.0rc1 上的结果,仅供参考。

============ Serving Benchmark Result ============
Successful requests:                     200
Failed requests:                         0
Request rate configured (RPS):           1.00
Benchmark duration (s):                  211.90
Total input tokens:                      40000
Total generated tokens:                  25600
Request throughput (req/s):              0.94
Output token throughput (tok/s):         120.81
Peak output token throughput (tok/s):    216.00
Peak concurrent requests:                24.00
Total token throughput (tok/s):          309.58
---------------Time to First Token----------------
Mean TTFT (ms):                          215.50
Median TTFT (ms):                        211.51
P99 TTFT (ms):                           317.18
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          98.96
Median TPOT (ms):                        99.19
P99 TPOT (ms):                           101.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           99.02
Median ITL (ms):                         96.10
P99 ITL (ms):                            176.02
==================================================