Qwen3-VL-Embedding#

简介#

Qwen3-VL-Embedding 和 Qwen3-VL-Reranker 模型系列是 Qwen 家族的最新成员,基于最近开源且强大的 Qwen3-VL 基础模型构建。该系列专为多模态信息检索和跨模态理解而设计,可接受包括文本、图像、截图和视频在内的多样化输入,以及包含这些模态混合的输入。本指南描述了如何使用 vLLM Ascend 运行该模型。

支持特性#

请参考支持特性以获取模型的支持特性矩阵。

环境准备#

模型权重#

建议将模型权重下载到多个节点的共享目录中,例如 /root/.cache/

安装#

您可以使用我们的官方 docker 镜像来运行 Qwen3-VL-Embedding 系列模型。

  • 在您的节点上启动 docker 镜像,请参考使用 docker

如果您不想使用上述 docker 镜像,也可以从源码构建所有内容:

部署#

以 Qwen3-VL-Embedding-8B 模型为例,首先使用以下命令运行 docker 容器:

在线推理#

vllm serve Qwen/Qwen3-VL-Embedding-8B --runner pooling

服务器启动后,您可以使用输入提示词查询模型。

curl -X POST http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{
  "input": [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
}'

离线推理#

import torch
from vllm import LLM

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'


if __name__=="__main__":
    # Each query must come with a one-sentence instruction that describes the task
    task = 'Given a web search query, retrieve relevant passages that answer the query'

    queries = [
        get_detailed_instruct(task, 'What is the capital of China?'),
        get_detailed_instruct(task, 'Explain gravity')
    ]
    # No need to add instruction for retrieval documents
    documents = [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
    input_texts = queries + documents

    model = LLM(model="Qwen/Qwen3-VL-Embedding-8B",
                runner="pooling",
                distributed_executor_backend="mp")

    outputs = model.embed(input_texts)
    embeddings = torch.tensor([o.outputs.embedding for o in outputs])
    scores = (embeddings[:2] @ embeddings[2:].T)
    print(scores.tolist())

如果成功运行此脚本,您将看到如下所示的信息:

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 192.47it/s]
Processed prompts:   0%|                                            | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=2425173) (Worker pid=2425180) INFO 01-09 00:44:40 [acl_graph.py:194] Replaying aclgraph
(EngineCore_DP0 pid=2425173) (Worker pid=2425180) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
Processed prompts: 100%|████████████████████████████████████| 4/4 [00:00<00:00, 21.34it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[[0.9279120564460754, 0.32747742533683777], [0.4124627113342285, 0.7425257563591003]]

更多示例,请参考 vLLM 官方示例:

性能#

Qwen3-VL-Embedding-8B 的运行性能为例。更多详情请参考 vllm 基准测试

serve 为例。按如下方式运行代码。

vllm bench serve --model Qwen/Qwen3-VL-Embedding-8B --backend openai-embeddings --dataset-name random --endpoint /v1/embeddings --random-input 200 --save-result --result-dir ./

大约几分钟后,您将获得性能评估结果。在本教程中,性能结果如下:

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  19.53
Total input tokens:                      200000
Request throughput (req/s):              51.20
Total token throughput (tok/s):          10240.42
----------------End-to-end Latency----------------
Mean E2EL (ms):                          10360.53
Median E2EL (ms):                        10354.37
P99 E2EL (ms):                           19423.21
==================================================