Qwen3-VL-30B-A3B-Instruct#

简介#

阿里云的 Qwen-VL(视觉-语言)系列包含一系列强大的大型视觉语言模型(LVLM),专为全面的多模态理解而设计。它们接受图像、文本和边界框作为输入,并输出文本和检测框,从而实现图像检测、多模态对话和多图像推理等高级功能。

本文档将展示 Qwen3-VL-30B-A3B-Instruct 的主要验证步骤。

支持的功能#

环境准备#

准备模型权重#

运行此模型需要 1 个 Atlas 800I A2 (64G × 8) 节点或 1 个 Atlas 800 A3 (64G × 16) 节点。

ModelScope 网站 下载模型权重,或使用以下命令下载:

pip install modelscope
modelscope download --model Qwen/Qwen3-VL-30B-A3B-Instruct

建议将模型权重下载到多个节点的共享目录中,例如 /root/.cache/

安装#

运行 Docker 容器:

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0

docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /data:/data \
-v <path/to/your/media>:/media \
-it $IMAGE bash

设置环境变量:

# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True

# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

备注

max_split_size_mb 可防止原生分配器拆分大于此大小(以 MB 为单位)的内存块。这可以减少内存碎片,并可能使一些临界工作负载在内存耗尽前完成。您可以在此处找到更多详细信息。

部署#

在线服务#

在容器内运行以下命令以在多 NPU 上启动 vLLM 服务器:

vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--limit-mm-per-prompt.video 0 \
--max-model-len 128000

备注

vllm-ascend 通过 --enable-expert-parallel 支持专家并行(EP),这允许将 MoE 模型中的专家部署在单独的 GPU 上以获得更好的吞吐量。

如果您的推理服务器仅处理图像输入,强烈建议指定 --limit-mm-per-prompt.video 0,因为启用视频输入会消耗更多为长视频嵌入保留的内存。

您可以设置 --max-model-len 以节省内存。默认情况下,模型的上下文长度为 262K,但 --max-model-len 128000 适用于大多数场景。

如果您的服务启动成功,您可以看到如下所示的信息:

INFO:     Started server process [746077]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

服务器启动后,您可以使用输入提示词查询模型:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
            {"type": "text", "text": "What is the text in the illustration?"}
        ]}
    ],
    "max_completion_tokens": 100
    }'

如果您成功查询服务器,您可以看到如下所示的信息(客户端):

{"id":"chatcmpl-974cb7a7a746a13e","object":"chat.completion","created":1766569357,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-30B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen\".","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":107,"total_tokens":122,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

vllm 服务器的日志:

INFO 12-24 09:42:37 [acl_graph.py:187] Replaying aclgraph
INFO:     127.0.0.1:54946 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-24 09:42:41 [loggers.py:257] Engine 000: Avg prompt throughput: 10.7 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%

在容器内运行以下命令以在多 NPU 上启动 vLLM 服务器:

vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--max-model-len 128000 \
--allowed-local-media-path /media

备注

vllm-ascend 通过 --enable-expert-parallel 支持专家并行(EP),这允许将 MoE 模型中的专家部署在单独的 GPU 上以获得更好的吞吐量。

您可以设置 --max-model-len 以节省内存。默认情况下,模型的上下文长度为 262K,但 --max-model-len 128000 适用于大多数场景。

设置 --allowed-local-media-path /media 以使用位于 /media 的本地视频,因为在服务期间直接下载视频可能因网络问题而极其缓慢。

如果您的服务启动成功,您可以看到如下所示的信息:

INFO:     Started server process [746077]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

服务器启动后,您可以使用输入提示词查询模型:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "video_url", "video_url": {"url": "file:///media/test.mp4"}},
            {"type": "text", "text": "What is in this video?"}
        ]}
    ],
    "max_completion_tokens": 100
    }'

如果您成功查询服务器,您可以看到如下所示的信息(客户端):

{"id":"chatcmpl-a03c6d6e40267738","object":"chat.completion","created":1766569752,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-30B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The video shows a standard test pattern, which is a series of vertical bars in various colors (red, green, blue, yellow, magenta, cyan, and white) arranged in a circular pattern on a black background. This is a common visual used in television broadcasting to calibrate and test equipment. The pattern remains static throughout the video.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":196,"total_tokens":266,"completion_tokens":70,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

vllm 服务器的日志:

INFO:     127.0.0.1:49314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-24 09:49:22 [loggers.py:257] Engine 000: Avg prompt throughput: 19.6 tokens/s, Avg generation throughput: 7.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 33.3%

离线推理#

Qwen3-VL-30B-A3B-Instruct 的离线推理使用方法与 Qwen3-VL-8B-Instruct 完全相同,更多详细信息请参阅 Qwen3-VL-8B-Instruct