Qwen3-Omni-30B-A3B-Thinking#
简介#
Qwen3-Omni 是原生端到端多语言全模态基础模型。它能处理文本、图像、音频和视频,并以文本和自然语音形式提供实时流式响应。我们引入了多项架构升级以提升性能和效率。Qwen3-Omni-30B-A3B 的 Thinking 模型包含思考器组件,具备思维链推理能力,支持音频、视频和文本输入,输出为文本。
本文档将展示该模型的主要验证步骤,包括支持的功能、功能配置、环境准备、单节点部署、精度和性能评估。
支持的功能#
Refer to supported features to get the model's supported feature matrix.
Refer to feature guide to get the feature's configuration.
环境准备#
模型权重#
Qwen3-Omni-30B-A3B-Thinkingrequires 2 NPU Cards (64G × 2).Download model weight It is recommended to download the model weight to the shared directory of multiple nodes, such as/root/.cache/
安装#
您可以使用我们的官方 Docker 镜像直接运行 Qwen3-Omni-30B-A3B-Thinking
根据您的机器类型选择镜像并在节点上启动 Docker 镜像,请参考 使用 Docker。
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
# Update the vllm-ascend image according to your environment.
# Note you should download the weight to /root/.cache in advance.
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.18.0
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
您可以从源码构建所有组件。
安装
vllm-ascend,请参考 使用 Python 设置。
请安装系统依赖
pip install qwen_omni_utils modelscope
# Used for audio processing.
apt-get update && apt-get install -y ffmpeg
# Check the installation.
ffmpeg -version
部署#
单节点部署#
多 NPU 离线推理#
运行以下脚本在多 NPU 上执行离线推理:
import gc
import torch
import os
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
destroy_distributed_environment,
destroy_model_parallel
)
from modelscope import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info
os.environ["HCCL_BUFFSIZE"] = "1024"
def clean_up():
"""Clean up distributed resources and NPU memory"""
destroy_model_parallel()
destroy_distributed_environment()
gc.collect() # Garbage collection to free up memory
torch.npu.empty_cache()
def main():
MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
llm = LLM(
model=MODEL_PATH,
tensor_parallel_size=2,
enable_expert_parallel=True,
distributed_executor_backend="mp",
limit_mm_per_prompt={'image': 5, 'video': 2, 'audio': 3},
max_model_len=32768,
)
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.95,
top_k=20,
max_tokens=16384,
)
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"},
{"type": "text", "text": "What can you see and hear? Answer in one sentence."}
]
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# 'use_audio_in_video = True' requires equal number of audio and video items, including audio from the video.
audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
inputs = {
"prompt": text,
"multi_modal_data": {},
"mm_processor_kwargs": {"use_audio_in_video": True}
}
if images is not None:
inputs['multi_modal_data']['image'] = images
if videos is not None:
inputs['multi_modal_data']['video'] = videos
if audios is not None:
inputs['multi_modal_data']['audio'] = audios
outputs = llm.generate([inputs], sampling_params=sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
if __name__ == "__main__":
main()
多 NPU 在线推理#
运行以下脚本在多 NPU 上启动 vLLM 服务器:对于具有 64 GB NPU 卡内存的 Atlas A2,tensor-parallel-size 应至少为 1;对于 32 GB 内存,tensor-parallel-size 应至少为 2。
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 --enable_expert_parallel
功能验证#
服务器启动后,您可以使用输入提示词查询模型。
curl http://localhost:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Omni-30B-A3B-Thinking",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"
}
},
{
"type": "audio_url",
"audio_url": {
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"
}
},
{
"type": "video_url",
"video_url": {
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
}
},
{
"type": "text",
"text": "Analyze this audio, image, and video together."
}
]
}
]
}'
精度评估#
以下是精度评估方法。
使用 EvalScope#
以 gsm8k、omnibench、bbh 数据集作为测试数据集为例,在在线模式下运行 Qwen3-Omni-30B-A3B-Thinking 的精度评估。
关于
evalscope的安装,请参考使用 evalscope (https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip)。运行
evalscope以执行精度评估。evalscope eval \ --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-Omni-30B-A3B-Thinking \ --api-url http://localhost:8000/v1 \ --api-key EMPTY \ --eval-type server \ --datasets omni_bench, gsm8k, bbh \ --dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \ --eval-batch-size 1 \ --generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \ --limit 100
执行后,您可以获得结果。以下是
Qwen3-Omni-30B-A3B-Thinking在 vllm-ascend:0.13.0rc1 中的结果,仅供参考。+-----------------------------+------------+----------+----------+-------+---------+---------+ | Model | Dataset | Metric | Subset | Num | Score | Cat.0 | +=============================+============+==========+==========+=======+=========+=========+ | Qwen3-Omni-30B-A3B-Thinking | omni_bench | mean_acc | default | 100 | 0.44 | default | +-----------------------------+------------+----------+----------+-------+---------+---------+ | Qwen3-Omni-30B-A3B-Thinking | gsm8k | mean_acc | main | 100 | 0.98 | default | +-----------------------------+-----------+----------+----------+-------+---------+---------+ | Qwen3-Omni-30B-A3B-Thinking | bbh | mean_acc | OVERALL | 270 | 0.9148 | | +-----------------------------+------------+----------+----------+-------+---------+---------+
性能#
使用 vLLM 基准测试#
以运行 Qwen3-Omni-30B-A3B-Thinking 的性能评估为例。更多详情请参考 vllm 基准测试。更多详情请参考 vllm 基准测试。
vllm bench 有三个子命令:
latency:对单批次请求的延迟进行基准测试。serve:对在线服务吞吐量进行基准测试。throughput:对离线推理吞吐量进行基准测试。
以 serve 为例。按如下方式运行代码。
export VLLM_USE_MODELSCOPE=True
export MODEL=Qwen/Qwen3-Omni-30B-A3B-Thinking
python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-size 2 --swap-space 16 --disable-log-stats --disable-log-request --load-format dummy
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r vllm-ascend/benchmarks/requirements-bench.txt
vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
执行后,您可以获得结果。以下是 Qwen3-Omni-30B-A3B-Thinking 在 vllm-ascend:0.13.0rc1 中的结果,仅供参考。
============ Serving Benchmark Result ============
Successful requests: 200
Failed requests: 0
Request rate configured (RPS): 1.00
Benchmark duration (s): 211.90
Total input tokens: 40000
Total generated tokens: 25600
Request throughput (req/s): 0.94
Output token throughput (tok/s): 120.81
Peak output token throughput (tok/s): 216.00
Peak concurrent requests: 24.00
Total token throughput (tok/s): 309.58
---------------Time to First Token----------------
Mean TTFT (ms): 215.50
Median TTFT (ms): 211.51
P99 TTFT (ms): 317.18
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 98.96
Median TPOT (ms): 99.19
P99 TPOT (ms): 101.52
---------------Inter-token Latency----------------
Mean ITL (ms): 99.02
Median ITL (ms): 96.10
P99 ITL (ms): 176.02
==================================================