Qwen3-VL-30B-A3B-Instruct#
简介#
阿里云的 Qwen-VL(视觉-语言)系列包含一系列强大的大型视觉语言模型(LVLM),专为全面的多模态理解而设计。它们接受图像、文本和边界框作为输入,并输出文本和检测框,从而实现图像检测、多模态对话和多图像推理等高级功能。
本文档将展示 Qwen3-VL-30B-A3B-Instruct 的主要验证步骤。
支持的功能#
环境准备#
准备模型权重#
运行此模型需要 1 个 Atlas 800I A2 (64G × 8) 节点或 1 个 Atlas 800 A3 (64G × 16) 节点。
从 ModelScope 网站 下载模型权重,或使用以下命令下载:
pip install modelscope
modelscope download --model Qwen/Qwen3-VL-30B-A3B-Instruct
建议将模型权重下载到多个节点的共享目录中,例如 /root/.cache/。
安装#
运行 Docker 容器:
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /data:/data \
-v <path/to/your/media>:/media \
-it $IMAGE bash
设置环境变量:
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
备注
max_split_size_mb 可防止原生分配器拆分大于此大小(以 MB 为单位)的内存块。这可以减少内存碎片,并可能使一些临界工作负载在内存耗尽前完成。您可以在此处找到更多详细信息。
部署#
在线服务#
在容器内运行以下命令以在多 NPU 上启动 vLLM 服务器:
vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--limit-mm-per-prompt.video 0 \
--max-model-len 128000
备注
vllm-ascend 通过 --enable-expert-parallel 支持专家并行(EP),这允许将 MoE 模型中的专家部署在单独的 GPU 上以获得更好的吞吐量。
如果您的推理服务器仅处理图像输入,强烈建议指定 --limit-mm-per-prompt.video 0,因为启用视频输入会消耗更多为长视频嵌入保留的内存。
您可以设置 --max-model-len 以节省内存。默认情况下,模型的上下文长度为 262K,但 --max-model-len 128000 适用于大多数场景。
如果您的服务启动成功,您可以看到如下所示的信息:
INFO: Started server process [746077]
INFO: Waiting for application startup.
INFO: Application startup complete.
服务器启动后,您可以使用输入提示词查询模型:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustration?"}
]}
],
"max_completion_tokens": 100
}'
如果您成功查询服务器,您可以看到如下所示的信息(客户端):
{"id":"chatcmpl-974cb7a7a746a13e","object":"chat.completion","created":1766569357,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-30B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen\".","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":107,"total_tokens":122,"completion_tokens":15,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
vllm 服务器的日志:
INFO 12-24 09:42:37 [acl_graph.py:187] Replaying aclgraph
INFO: 127.0.0.1:54946 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-24 09:42:41 [loggers.py:257] Engine 000: Avg prompt throughput: 10.7 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
在容器内运行以下命令以在多 NPU 上启动 vLLM 服务器:
vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--max-model-len 128000 \
--allowed-local-media-path /media
备注
vllm-ascend 通过 --enable-expert-parallel 支持专家并行(EP),这允许将 MoE 模型中的专家部署在单独的 GPU 上以获得更好的吞吐量。
您可以设置 --max-model-len 以节省内存。默认情况下,模型的上下文长度为 262K,但 --max-model-len 128000 适用于大多数场景。
设置 --allowed-local-media-path /media 以使用位于 /media 的本地视频,因为在服务期间直接下载视频可能因网络问题而极其缓慢。
如果您的服务启动成功,您可以看到如下所示的信息:
INFO: Started server process [746077]
INFO: Waiting for application startup.
INFO: Application startup complete.
服务器启动后,您可以使用输入提示词查询模型:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "video_url", "video_url": {"url": "file:///media/test.mp4"}},
{"type": "text", "text": "What is in this video?"}
]}
],
"max_completion_tokens": 100
}'
如果您成功查询服务器,您可以看到如下所示的信息(客户端):
{"id":"chatcmpl-a03c6d6e40267738","object":"chat.completion","created":1766569752,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-30B-A3B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The video shows a standard test pattern, which is a series of vertical bars in various colors (red, green, blue, yellow, magenta, cyan, and white) arranged in a circular pattern on a black background. This is a common visual used in television broadcasting to calibrate and test equipment. The pattern remains static throughout the video.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":196,"total_tokens":266,"completion_tokens":70,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
vllm 服务器的日志:
INFO: 127.0.0.1:49314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-24 09:49:22 [loggers.py:257] Engine 000: Avg prompt throughput: 19.6 tokens/s, Avg generation throughput: 7.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 33.3%
离线推理#
Qwen3-VL-30B-A3B-Instruct 的离线推理使用方法与 Qwen3-VL-8B-Instruct 完全相同,更多详细信息请参阅 Qwen3-VL-8B-Instruct。