Qwen3-Embedding

Qwen3-Embedding#

简介#

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only vLLM Ascend 0.9.2rc1 and higher versions support the model.

支持的功能#

请参考支持的功能以获取该模型的支持功能矩阵。

环境准备#

模型权重#

Qwen3-Embedding-8B 下载模型权重
Qwen3-Embedding-4B 下载模型权重
Qwen3-Embedding-0.6B 下载模型权重

建议将模型权重下载到多个节点的共享目录中，例如 /root/.cache/

安装#

您可以使用我们的官方 docker 镜像来运行 Qwen3-Embedding 系列模型。

在您的节点上启动 docker 镜像，请参考使用 docker。

如果您不想使用上述的 docker 镜像，也可以从源代码构建所有内容：

从源代码安装 vllm-ascend，请参考安装指南。

部署#

以 Qwen3-Embedding-8B 模型为例，首先使用以下命令运行 docker 容器：

在线推理#

vllm serve Qwen/Qwen3-Embedding-8B --runner pooling --host 127.0.0.1 --port 8888

一旦您的服务器启动，您就可以使用输入提示词查询模型。

curl http://localhost:8888/v1/embeddings -H "Content-Type: application/json" -d '{
  "input": [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
}'

离线推理#

import torch
import vllm
from vllm import LLM

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'


if __name__=="__main__":
    # Each query must come with a one-sentence instruction that describes the task
    task = 'Given a web search query, retrieve relevant passages that answer the query'

    queries = [
        get_detailed_instruct(task, 'What is the capital of China?'),
        get_detailed_instruct(task, 'Explain gravity')
    ]
    # No need to add instruction for retrieval documents
    documents = [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
    input_texts = queries + documents

    model = LLM(model="Qwen/Qwen3-Embedding-8B",
                distributed_executor_backend="mp")

    outputs = model.embed(input_texts)
    embeddings = torch.tensor([o.outputs.embedding for o in outputs])
    scores = (embeddings[:2] @ embeddings[2:].T)
    print(scores.tolist())

如果您成功运行此脚本，您将看到如下所示的信息：

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 282.22it/s]
Processed prompts:   0%|                                                                                                                                    | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](VllmWorker rank=0 pid=4074750) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 31.95it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[[0.7477798461914062, 0.07548339664936066], [0.0886271521449089, 0.6311039924621582]]

性能#

以 Qwen3-Embedding-8B 的运行性能为例。更多详情请参考 vllm 基准测试。

以 serve 为例。按如下方式运行代码。

vllm bench serve --model Qwen3-Embedding-8B --backend openai-embeddings --dataset-name random --host 127.0.0.1 --port 8888 --endpoint /v1/embeddings --tokenizer /root/.cache/Qwen3-Embedding-8B --random-input 200 --save-result --result-dir ./

大约几分钟后，您将获得性能评估结果。按照本教程，性能结果如下：

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  6.78
Total input tokens:                      108032
Request throughput (req/s):              31.11
Total Token throughput (tok/s):          15929.35
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4422.79
Median E2EL (ms):                        4412.58
P99 E2EL (ms):                           6294.52
==================================================