Qwen3-Embedding#
简介#
Qwen3 Embedding 模型系列是 Qwen 家族最新的专有模型,专为文本嵌入和排序任务设计。它基于 Qwen3 系列的稠密基础模型,提供了多种尺寸(0.6B、4B 和 8B)的全面文本嵌入和重排序模型。本指南描述了如何使用 vLLM Ascend 运行该模型。请注意,只有 vLLM Ascend 0.9.2rc1 及更高版本支持此模型。
支持的功能#
请参考支持的功能以获取该模型的支持功能矩阵。
环境准备#
模型权重#
建议将模型权重下载到多个节点的共享目录中,例如 /root/.cache/
安装#
您可以使用我们的官方 docker 镜像来运行 Qwen3-Embedding 系列模型。
在您的节点上启动 docker 镜像,请参考使用 docker。
如果您不想使用上述的 docker 镜像,也可以从源代码构建所有内容:
从源代码安装
vllm-ascend,请参考安装指南。
部署#
以 Qwen3-Embedding-8B 模型为例,首先使用以下命令运行 docker 容器:
在线推理#
vllm serve Qwen/Qwen3-Embedding-8B --runner pooling --host 127.0.0.1 --port 8888
一旦您的服务器启动,您就可以使用输入提示词查询模型。
curl http://localhost:8888/v1/embeddings -H "Content-Type: application/json" -d '{
"input": [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
}'
离线推理#
import torch
import vllm
from vllm import LLM
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery:{query}'
if __name__=="__main__":
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'What is the capital of China?'),
get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
model = LLM(model="Qwen/Qwen3-Embedding-8B",
distributed_executor_backend="mp")
outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
如果您成功运行此脚本,您将看到如下所示的信息:
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 282.22it/s]
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](VllmWorker rank=0 pid=4074750) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 31.95it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[[0.7477798461914062, 0.07548339664936066], [0.0886271521449089, 0.6311039924621582]]
性能#
以 Qwen3-Embedding-8B 的运行性能为例。更多详情请参考 vllm 基准测试。
以 serve 为例。按如下方式运行代码。
vllm bench serve --model Qwen3-Embedding-8B --backend openai-embeddings --dataset-name random --host 127.0.0.1 --port 8888 --endpoint /v1/embeddings --tokenizer /root/.cache/Qwen3-Embedding-8B --random-input 200 --save-result --result-dir ./
大约几分钟后,您将获得性能评估结果。按照本教程,性能结果如下:
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Benchmark duration (s): 6.78
Total input tokens: 108032
Request throughput (req/s): 31.11
Total Token throughput (tok/s): 15929.35
----------------End-to-end Latency----------------
Mean E2EL (ms): 4422.79
Median E2EL (ms): 4412.58
P99 E2EL (ms): 6294.52
==================================================