Qwen3-VL-Reranker#
简介#
Qwen3-VL-Embedding 和 Qwen3-VL-Reranker 模型系列是 Qwen 家族的最新成员,构建于最近开源且功能强大的 Qwen3-VL 基础模型之上。该系列专为多模态信息检索和跨模态理解而设计,接受多种输入,包括文本、图像、截图和视频,以及包含这些模态混合的输入。本指南介绍如何使用 vLLM Ascend 运行该模型。
支持的特性#
请参阅支持的特性矩阵获取模型支持的特性列表。
环境准备#
模型权重#
建议将模型权重下载到多节点共享目录,例如 /root/.cache/
安装#
您可以使用官方 Docker 镜像运行 Qwen3-VL-Reranker 系列模型。
在节点上启动 Docker 镜像,请参考使用 Docker 安装。
如果您不想使用上述 Docker 镜像,也可以从源码构建:
从源码安装
vllm-ascend,请参考安装指南。
部署#
以 Qwen3-VL-Reranker-8B 模型为例:
对话模板#
Qwen3-VL-Reranker 模型需要特定的对话模板才能正确格式化。创建一个名为 qwen3_vl_reranker.jinja 的文件,内容如下:
<|im_start|>system
Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>
<|im_start|>user
<Instruct>: {{
messages
| selectattr("role", "eq", "system")
| map(attribute="content")
| first
| default("Given a search query, retrieve relevant candidates that answer the query.")
}}<Query>:{{
messages
| selectattr("role", "eq", "query")
| map(attribute="content")
| first
}}
<Document>:{{
messages
| selectattr("role", "eq", "document")
| map(attribute="content")
| first
}}<|im_end|>
<|im_start|>assistant
将此文件保存到您选择的位置(例如 ./qwen3_vl_reranker.jinja)。
在线推理#
使用以下命令启动服务器:
vllm serve Qwen/Qwen3-VL-Reranker-8B \
--runner pooling \
--max-model-len 4096 \
--hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \
--chat-template ./qwen3_vl_reranker.jinja
服务器启动后,您可以按照以下示例发送请求。
import requests
url = "http://127.0.0.1:8000/v1/rerank"
# Please use the query_template and document_template to format the query and
# document for better reranker results.
prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
suffix = "<|im_end|>\n<|im_start|>assistant\n"
query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
document_template = "<Document>: {doc}{suffix}"
instruction = (
"Given a search query, retrieve relevant candidates that answer the query."
)
query = "What is the capital of China?"
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]
documents = [
document_template.format(doc=doc, suffix=suffix) for doc in documents
]
response = requests.post(url,
json={
"query": query_template.format(prefix=prefix, instruction=instruction, query=query),
"documents": documents,
}).json()
print(response)
如果您成功运行此脚本,您将在控制台看到类似如下的分数列表:
{'id': 'rerank-ac3495afa8e12404', 'model': 'Qwen/Qwen3-VL-Reranker-8B', 'usage': {'prompt_tokens': 315, 'total_tokens': 315}, 'results': [{'index': 0, 'document': {'text': '<Document>: The capital of China is Beijing.<|im_end|>\n<|im_start|>assistant\n', 'multi_modal': None}, 'relevance_score': 0.6368980407714844}, {'index': 1, 'document': {'text': '<Document>: Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.<|im_end|>\n<|im_start|>assistant\n', 'multi_modal': None}, 'relevance_score': 0.20816077291965485}]}
离线推理#
from vllm import LLM
model_name = "Qwen/Qwen3-VL-Reranker-8B"
# What is the difference between the official original version and one
# that has been converted into a sequence classification model?
# Qwen3-VL-Reranker is a language model that doing reranker by using the
# logits of "no" and "yes" tokens.
# It needs to compute 151669 tokens logits, making this method extremely
# inefficient, not to mention incompatible with the vLLM score API.
# A method for converting the original model into a sequence classification
# model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
# Models converted offline using this method can not only be more efficient
# and support the vLLM score API, but also make the init parameters more
# concise, for example.
# model = LLM(model="Qwen/Qwen3-VL-Reranker-8B", runner="pooling")
# If you want to load the official original version, the init parameters are
# as follows.
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
# Manually route to sequence classification architecture
# This tells vLLM to use Qwen3VLForSequenceClassification instead of
# the default Qwen3VLForConditionalGeneration
"architectures": ["Qwen3VLForSequenceClassification"],
# Specify which token logits to extract from the language model head
# The original reranker uses "no" and "yes" token logits for scoring
"classifier_from_token": ["no", "yes"],
# Enable special handling for original Qwen3-Reranker models
# This flag triggers conversion logic that transforms the two token
# vectors into a single classification vector
"is_original_qwen3_reranker": True,
},
)
# Why do we need hf_overrides for the official original version:
# vLLM converts it to Qwen3VLForSequenceClassification when loaded for
# better performance.
# - Firstly, we need to use `"architectures": ["Qwen3VLForSequenceClassification"],`
# to manually route to Qwen3VLForSequenceClassification.
# - Then, we will extract the vector corresponding to classifier_from_token
# from lm_head using `"classifier_from_token": ["no", "yes"]`.
# - Third, we will convert these two vectors into one vector. The use of
# conversion logic is controlled by `using "is_original_qwen3_reranker": True`.
# Please use the query_template and document_template to format the query and
# document for better reranker results.
prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
suffix = "<|im_end|>\n<|im_start|>assistant\n"
query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
document_template = "<Document>: {doc}{suffix}"
if __name__ == "__main__":
instruction = (
"Given a search query, retrieve relevant candidates that answer the query."
)
query = "What is the capital of China?"
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]
documents = [document_template.format(doc=doc, suffix=suffix) for doc in documents]
outputs = model.score(query_template.format(prefix=prefix, instruction=instruction, query=query), documents)
print([output.outputs.score for output in outputs])
如果您成功运行此脚本,您将在控制台看到类似如下的分数列表:
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2409.83it/s]
Processed prompts: 0%| | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=682882) INFO 01-20 04:38:46 [acl_graph.py:188] Replaying aclgraph
Processed prompts: 100%|████████████████████████████████████| 2/2 [00:00<00:00, 9.44it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[0.7235596776008606, 0.0002742875076364726]
更多示例,请参考 vLLM 官方示例:
性能#
以 Qwen3-VL-Reranker-8B 为例运行性能测试。更多详细信息请参考 vLLM 基准测试。
以 serve 为例,运行如下代码。
vllm bench serve --model Qwen/Qwen3-VL-Reranker-8B --backend vllm-rerank --dataset-name random-rerank --endpoint /v1/rerank --random-input 200 --save-result --result-dir ./
大约几分钟后,即可获得性能评估结果。根据本教程,性能结果如下:
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Benchmark duration (s): 13.70
Total input tokens: 265122
Request throughput (req/s): 72.99
Total token throughput (tok/s): 19351.23
----------------End-to-end Latency----------------
Mean E2EL (ms): 7474.64
Median E2EL (ms): 7528.72
P99 E2EL (ms): 13523.32
==================================================