Qwen-VL-Dense(Qwen3-VL-2B/4B/8B/32B)#
简介#
阿里云的通义千问视觉语言(Qwen-VL)系列包含一系列强大的大型视觉语言模型(LVLM),专为全面的多模态理解而设计。它们接受图像、文本和边界框作为输入,输出文本和检测框,支持图像检测、多模态对话和多图像推理等高级功能。
本文档将展示模型的主要验证步骤,包括支持的特性、特性配置、环境准备、NPU 部署、精度评估和性能评估。
本教程使用 vLLM-Ascend v0.11.0rc3-a3 版本进行演示,以 Qwen3-VL-8B-Instruct 模型作为单 NPU 和多 NPU 部署的示例。
支持的特性#
请参阅支持的特性矩阵获取模型支持的特性列表。
请参阅特性指南获取特性的配置方法。
环境准备#
模型权重#
需要 1 个 Atlas 800I A2(64G × 8)节点或 1 个 Atlas 800 A3(64G × 16)节点:
Qwen3-VL-2B-Instruct:下载模型权重Qwen3-VL-4B-Instruct:下载模型权重Qwen3-VL-8B-Instruct:下载模型权重Qwen3-VL-32B-Instruct:下载模型权重
建议将模型权重下载到多节点共享目录,例如 /root/.cache/
安装#
运行 Docker 容器:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.20.2rc1
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
运行 Docker 容器:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.20.2rc1
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /data:/data \
-it $IMAGE bash
设置环境变量:
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
备注
max_split_size_mb 可防止原生分配器拆分大于此大小(以 MB 为单位)的块。这可以减少碎片,并可能使一些边缘工作负载在不耗尽内存的情况下完成。更多详情请参见此处。
部署#
离线推理#
运行以下脚本在单 NPU 上执行离线推理:
pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
MODEL_PATH = "Qwen/Qwen3-VL-8B-Instruct"
llm = LLM(
model=MODEL_PATH,
max_model_len=16384,
limit_mm_per_prompt={"image": 10},
)
sampling_params = SamplingParams(
max_tokens=512
)
image_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
"min_pixels": 224 * 224,
"max_pixels": 1280 * 28 * 28,
},
{"type": "text", "text": "Please provide a detailed description of this image"},
],
},
]
messages = image_messages
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
如果成功运行此脚本,将看到如下所示的信息:
**Visual Components:**
1. **Abstract Geometric Icon (Left Side):**
* The logo features a stylized, abstract icon on the left.
* It is composed of interconnected lines and angular shapes, forming a complex, hexagonal-like structure.
* The icon is rendered in a solid, thin blue line, giving it a modern, technological, and clean appearance.
2. **Text (Right Side):**
* To the right of the icon, the name "TONGYI Qwen" is written.
* **"TONGYI"** is written in uppercase letters in a bold, modern sans-serif font. The color is a medium blue, matching the icon's color.
* **"Qwen"** is written below "TONGYI" in a slightly larger, bold, sans-serif font. The color of "Qwen" is a dark gray or black, creating a strong contrast with the blue text above it.
* The text is aligned and spaced neatly, with "Qwen" appearing slightly larger and bolder than "TONGYI," emphasizing the proper noun.
**Overall Design and Aesthetics:**
* The logo has a clean, contemporary, and professional feel, suitable for a technology and AI product.
* The use of blue conveys trust, innovation, and intelligence, while the dark gray adds stability and clarity.
* The overall layout is balanced and symmetrical, with the icon and text arranged horizontally for easy recognition and memorability.
* The design effectively communicates the product's high-tech nature while remaining brand-identifiable and straightforward.
The logo is designed to be easily recognizable across various media and scales, from digital screens to printed materials.
运行以下脚本在多 NPU 上执行离线推理:
pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
MODEL_PATH = "Qwen/Qwen3-VL-32B-Instruct"
llm = LLM(
model=MODEL_PATH,
tensor_parallel_size=2,
max_model_len=16384,
limit_mm_per_prompt={"image": 10},
)
sampling_params = SamplingParams(
max_completion_tokens=512
)
image_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
"min_pixels": 224 * 224,
"max_pixels": 1280 * 28 * 28,
},
{"type": "text", "text": "Please provide a detailed description of this image"},
],
},
]
messages = image_messages
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
如果成功运行此脚本,将看到如下所示的信息:
The image displays a logo and text related to the Qwen model, which is an artificial intelligence (AI) language model developed by Alibaba Cloud. Here is a detailed description of the elements in the image:
### **1. Logo:**
- The logo on the left side of the image consists of a stylized, abstract geometric design.
- The logo is primarily composed of interconnected lines and shapes that resemble a combination of arrows, lines, and geometric forms.
- The lines are arranged in a triangular pattern, giving it a dynamic and modern appearance.
- The lines are rendered in a dark blue color, and they form a three-dimensional, arrow-like structure. This conveys a sense of movement, forward momentum, or direction, which is often symbolic of progress and integration.
- The design appears to be complex yet minimalistic, with clean and sharp lines.
- The triangular and square-like structure suggests precision, connectivity, and innovation, which are often associated with technology and advanced systems.
- This abstract, arrow-like design implies a sense of flow, direction, and connectivity, which aligns with themes of progress and technological advancement.
### **2. Text:**
- **"TONGYI" (on the top right side):
- The text is in dark blue, which is a color often associated with technology, stability, and trustworthiness.
- The name "Tongyi" is written in a bold, sans-serif font, giving it a modern and professional look.
- **"Qwen" (below "Tongyi"):
- The font for "Qwen" is in a bold, uppercase format.
- The style
在线服务#
运行 Docker 容器在单 NPU 上启动 vLLM 服务器:
vllm serve Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--max_model_len 16384 \
--max-num-batched-tokens 16384
备注
添加 --max_model_len 选项以避免 Qwen3-VL-8B-Instruct 模型的最大序列长度(256000)大于 KV 缓存可存储的最大 token 数而导致的 ValueError。不同 NPU 系列基于片上内存大小会有所不同。请根据您的 NPU 系列修改为合适的值。
如果服务启动成功,您将看到如下所示的信息:
INFO: Started server process [2736]
INFO: Waiting for application startup.
INFO: Application startup complete.
服务器启动后,您可以使用输入提示词查询模型:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustration?"}
]}
]
}'
如果查询服务器成功,您将看到如下所示的信息(客户端):
{"id":"chatcmpl-d3270d4a16cb4b98936f71ee3016451f","object":"chat.completion","created":1764924127,"model":"Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is: **TONGYI Qwen**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":107,"total_tokens":123,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
vLLM 服务器日志:
INFO 12-05 08:42:07 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct
INFO 12-05 08:42:11 [acl_graph.py:187] Replaying aclgraph
INFO: 127.0.0.1:60988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-05 08:42:13 [loggers.py:127] Engine 000: Avg prompt throughput: 10.7 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-05 08:42:23 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
运行 Docker 容器在多 NPU 上启动 vLLM 服务器:
#!/bin/sh
# if os is Ubuntu
apt update
apt install libjemalloc2
# if os is openEuler
yum update
yum install jemalloc
# Add the LD_PRELOAD environment variable
if [ -f /usr/lib/aarch64-linux-gnu/libjemalloc.so.2 ]; then
# On Ubuntu, first install with `apt install libjemalloc2`
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
elif [ -f /usr/lib64/libjemalloc.so.2 ]; then
# On openEuler, first install with `yum install jemalloc`
export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
fi
# Enable the AIVector core to directly schedule ROCE communication
export HCCL_OP_EXPANSION_MODE="AIV"
# Set vLLM to Engine V1
export VLLM_USE_V1=1
vllm serve Qwen/Qwen3-VL-32B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len 30000 \
--max-num-batched-tokens 50000 \
--max-num-seqs 30 \
--no-enable-prefix-caching \
--trust-remote-code \
--dtype bfloat16
备注
添加 --max_model_len 选项以避免 Qwen3-VL-32B-Instruct 模型的 max_model_len(128000)大于 KV 缓存可存储的最大 token 数而导致的 ValueError。不同 NPU 系列基于片上内存大小会有所不同。请根据您的 NPU 系列修改为合适的值。
如果服务启动成功,您将看到如下所示的信息:
INFO: Started server process [14431]
INFO: Waiting for application startup.
INFO: Application startup complete.
服务器启动后,您可以使用输入提示词查询模型:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-32B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustration?"}
]}
]
}'
如果查询服务器成功,您将看到如下所示的信息(客户端):
{"id":"chatcmpl-c07088bf992a4b77a89d79480122a483","object":"chat.completion","created":1764905884,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is:\n\n**TONGYI Qwen**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":73,"total_tokens":89,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
vLLM 服务器日志:
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
INFO 12-05 08:50:57 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-32B-Instruct
2025-12-05 08:50:58,913 - modelscope - INFO - Target directory already exists, skipping creation.
INFO 12-05 08:51:00 [acl_graph.py:187] Replaying aclgraph
INFO: 127.0.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-05 08:51:10 [loggers.py:127] Engine 000: Avg prompt throughput: 7.3 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-05 08:51:20 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
精度评估#
使用 Language Model Evaluation Harness#
以下模型的精度已在我们的 CI 监控范围内:
Qwen3-VL-8B-Instruct
以 mmmu_val 数据集为例,以离线模式运行 Qwen3-VL-8B-Instruct 的精度评估。
有关
lm_eval安装的更多详细信息,请参考使用 lm_eval 进行精度评估。pip install lm_eval
运行
lm_eval执行精度评估。lm_eval \ --model vllm-vlm \ --model_args pretrained=Qwen/Qwen3-VL-8B-Instruct,max_model_len=8192,gpu_memory_utilization=0.7 \ --tasks mmmu_val \ --batch_size 32 \ --apply_chat_template \ --trust_remote_code \ --output_path ./results
执行后即可获得结果,以下是
Qwen3-VL-8B-Instruct在vllm-ascend:0.11.0rc3上的结果,仅供参考。任务
版本
过滤器
n-shot
指标
值
标准误差
mmmu_val
0
无
准确率
↑
0.5389
±
0.0159
性能#
使用 vLLM 基准测试#
更多详细信息请参考 vLLM 基准测试。
vllm bench 包含三个子命令:
latency:基准测试单批次请求的延迟。serve:基准测试在线服务的吞吐量。throughput:基准测试离线推理的吞吐量。
性能评估必须在在线模式下进行。以 serve 为例,运行如下代码。
vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
大约几分钟后,即可获得性能评估结果。