Qwen-VL-Dense(Qwen3-VL-2B/4B/8B/32B)#

简介#

阿里云的通义千问视觉语言(Qwen-VL)系列包含一系列强大的大型视觉语言模型(LVLM),专为全面的多模态理解而设计。它们接受图像、文本和边界框作为输入,输出文本和检测框,支持图像检测、多模态对话和多图像推理等高级功能。

本文档将展示模型的主要验证步骤,包括支持的特性、特性配置、环境准备、NPU 部署、精度评估和性能评估。

本教程使用 vLLM-Ascend v0.11.0rc3-a3 版本进行演示,以 Qwen3-VL-8B-Instruct 模型作为单 NPU 和多 NPU 部署的示例。

支持的特性#

请参阅支持的特性矩阵获取模型支持的特性列表。

请参阅特性指南获取特性的配置方法。

环境准备#

模型权重#

需要 1 个 Atlas 800I A2(64G × 8)节点或 1 个 Atlas 800 A3(64G × 16)节点:

建议将模型权重下载到多节点共享目录,例如 /root/.cache/

安装#

运行 Docker 容器:

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.20.2rc1

docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash

运行 Docker 容器:

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.20.2rc1

docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /data:/data \
-it $IMAGE bash

设置环境变量:

# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True

# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

备注

max_split_size_mb 可防止原生分配器拆分大于此大小(以 MB 为单位)的块。这可以减少碎片,并可能使一些边缘工作负载在不耗尽内存的情况下完成。更多详情请参见此处

部署#

离线推理#

运行以下脚本在单 NPU 上执行离线推理:

pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen3-VL-8B-Instruct"

llm = LLM(
    model=MODEL_PATH,
    max_model_len=16384,
    limit_mm_per_prompt={"image": 10},
)

sampling_params = SamplingParams(
    max_tokens=512
)

image_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "Please provide a detailed description of this image"},
        ],
    },
]

messages = image_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

如果成功运行此脚本,将看到如下所示的信息:

**Visual Components:**

1.  **Abstract Geometric Icon (Left Side):**
    *   The logo features a stylized, abstract icon on the left.
    *   It is composed of interconnected lines and angular shapes, forming a complex, hexagonal-like structure.
    *   The icon is rendered in a solid, thin blue line, giving it a modern, technological, and clean appearance.

2.  **Text (Right Side):**
    *   To the right of the icon, the name "TONGYI Qwen" is written.
    *   **"TONGYI"** is written in uppercase letters in a bold, modern sans-serif font. The color is a medium blue, matching the icon's color.
    *   **"Qwen"** is written below "TONGYI" in a slightly larger, bold, sans-serif font. The color of "Qwen" is a dark gray or black, creating a strong contrast with the blue text above it.
    *   The text is aligned and spaced neatly, with "Qwen" appearing slightly larger and bolder than "TONGYI," emphasizing the proper noun.

**Overall Design and Aesthetics:**

*   The logo has a clean, contemporary, and professional feel, suitable for a technology and AI product.
*   The use of blue conveys trust, innovation, and intelligence, while the dark gray adds stability and clarity.
*   The overall layout is balanced and symmetrical, with the icon and text arranged horizontally for easy recognition and memorability.
*   The design effectively communicates the product's high-tech nature while remaining brand-identifiable and straightforward.

The logo is designed to be easily recognizable across various media and scales, from digital screens to printed materials.

运行以下脚本在多 NPU 上执行离线推理:

pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen3-VL-32B-Instruct"

llm = LLM(
    model=MODEL_PATH,
    tensor_parallel_size=2,
    max_model_len=16384,
    limit_mm_per_prompt={"image": 10},
)

sampling_params = SamplingParams(
    max_completion_tokens=512
)

image_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "Please provide a detailed description of this image"},
        ],
    },
]

messages = image_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

如果成功运行此脚本,将看到如下所示的信息:

The image displays a logo and text related to the Qwen model, which is an artificial intelligence (AI) language model developed by Alibaba Cloud. Here is a detailed description of the elements in the image:

### **1. Logo:**
- The logo on the left side of the image consists of a stylized, abstract geometric design.
- The logo is primarily composed of interconnected lines and shapes that resemble a combination of arrows, lines, and geometric forms.
- The lines are arranged in a triangular pattern, giving it a dynamic and modern appearance.
- The lines are rendered in a dark blue color, and they form a three-dimensional, arrow-like structure. This conveys a sense of movement, forward momentum, or direction, which is often symbolic of progress and integration.
- The design appears to be complex yet minimalistic, with clean and sharp lines.
- The triangular and square-like structure suggests precision, connectivity, and innovation, which are often associated with technology and advanced systems.
- This abstract, arrow-like design implies a sense of flow, direction, and connectivity, which aligns with themes of progress and technological advancement.

### **2. Text:**
- **"TONGYI" (on the top right side):
  - The text is in dark blue, which is a color often associated with technology, stability, and trustworthiness.
  - The name "Tongyi" is written in a bold, sans-serif font, giving it a modern and professional look.
- **"Qwen" (below "Tongyi"):
  - The font for "Qwen" is in a bold, uppercase format.
  - The style

在线服务#

运行 Docker 容器在单 NPU 上启动 vLLM 服务器:

vllm serve Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--max_model_len 16384 \
--max-num-batched-tokens 16384

备注

添加 --max_model_len 选项以避免 Qwen3-VL-8B-Instruct 模型的最大序列长度(256000)大于 KV 缓存可存储的最大 token 数而导致的 ValueError。不同 NPU 系列基于片上内存大小会有所不同。请根据您的 NPU 系列修改为合适的值。

如果服务启动成功,您将看到如下所示的信息:

INFO:     Started server process [2736]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

服务器启动后,您可以使用输入提示词查询模型:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen3-VL-8B-Instruct",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustration?"}
    ]}
    ]
    }'

如果查询服务器成功,您将看到如下所示的信息(客户端):

{"id":"chatcmpl-d3270d4a16cb4b98936f71ee3016451f","object":"chat.completion","created":1764924127,"model":"Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is: **TONGYI Qwen**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":107,"total_tokens":123,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

vLLM 服务器日志:

INFO 12-05 08:42:07 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct
INFO 12-05 08:42:11 [acl_graph.py:187] Replaying aclgraph
INFO:     127.0.0.1:60988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-05 08:42:13 [loggers.py:127] Engine 000: Avg prompt throughput: 10.7 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-05 08:42:23 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

运行 Docker 容器在多 NPU 上启动 vLLM 服务器:

#!/bin/sh
# if os is Ubuntu
apt update
apt install libjemalloc2 
# if os is openEuler
yum update
yum install jemalloc
# Add the LD_PRELOAD environment variable
if [ -f /usr/lib/aarch64-linux-gnu/libjemalloc.so.2 ]; then
    # On Ubuntu, first install with `apt install libjemalloc2`
    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
elif [ -f /usr/lib64/libjemalloc.so.2 ]; then
    # On openEuler, first install with `yum install jemalloc`
    export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
fi
# Enable the AIVector core to directly schedule ROCE communication
export HCCL_OP_EXPANSION_MODE="AIV"
# Set vLLM to Engine V1
export VLLM_USE_V1=1

vllm serve Qwen/Qwen3-VL-32B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --max-model-len 30000 \
    --max-num-batched-tokens 50000 \
    --max-num-seqs 30 \
    --no-enable-prefix-caching \
    --trust-remote-code \
    --dtype bfloat16

备注

添加 --max_model_len 选项以避免 Qwen3-VL-32B-Instruct 模型的 max_model_len(128000)大于 KV 缓存可存储的最大 token 数而导致的 ValueError。不同 NPU 系列基于片上内存大小会有所不同。请根据您的 NPU 系列修改为合适的值。

如果服务启动成功,您将看到如下所示的信息:

INFO:     Started server process [14431]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

服务器启动后,您可以使用输入提示词查询模型:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen3-VL-32B-Instruct",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustration?"}
    ]}
    ]
    }'

如果查询服务器成功,您将看到如下所示的信息(客户端):

{"id":"chatcmpl-c07088bf992a4b77a89d79480122a483","object":"chat.completion","created":1764905884,"model":"Qwen/Qwen3-VL-32B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is:\n\n**TONGYI Qwen**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":73,"total_tokens":89,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

vLLM 服务器日志:

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
INFO 12-05 08:50:57 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-32B-Instruct
2025-12-05 08:50:58,913 - modelscope - INFO - Target directory already exists, skipping creation.
INFO 12-05 08:51:00 [acl_graph.py:187] Replaying aclgraph
INFO:     127.0.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-05 08:51:10 [loggers.py:127] Engine 000: Avg prompt throughput: 7.3 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-05 08:51:20 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

精度评估#

使用 Language Model Evaluation Harness#

以下模型的精度已在我们的 CI 监控范围内:

  • Qwen3-VL-8B-Instruct

mmmu_val 数据集为例,以离线模式运行 Qwen3-VL-8B-Instruct 的精度评估。

  1. 有关 lm_eval 安装的更多详细信息,请参考使用 lm_eval 进行精度评估

    pip install lm_eval
    
  2. 运行 lm_eval 执行精度评估。

    lm_eval \
        --model vllm-vlm \
        --model_args pretrained=Qwen/Qwen3-VL-8B-Instruct,max_model_len=8192,gpu_memory_utilization=0.7 \
        --tasks mmmu_val \
        --batch_size 32 \
        --apply_chat_template \
        --trust_remote_code \
        --output_path ./results
    
  3. 执行后即可获得结果,以下是 Qwen3-VL-8B-Instructvllm-ascend:0.11.0rc3 上的结果,仅供参考。

    任务

    版本

    过滤器

    n-shot

    指标

    标准误差

    mmmu_val

    0

    准确率

    0.5389

    ±

    0.0159

性能#

使用 vLLM 基准测试#

更多详细信息请参考 vLLM 基准测试

vllm bench 包含三个子命令:

  • latency:基准测试单批次请求的延迟。

  • serve:基准测试在线服务的吞吐量。

  • throughput:基准测试离线推理的吞吐量。

性能评估必须在在线模式下进行。以 serve 为例,运行如下代码。

vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct  --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./

大约几分钟后,即可获得性能评估结果。