Text Pooling¶
Metal V1 has experimental text-only embed pooling support for compatible
pooling models. Supported requests run as prefill-only work, return one CPU
L2-normalized embedding tensor per finished request through vLLM's
pooler_output contract, and do not sample generation tokens.
It also has experimental text-only classify support for original Qwen3
reranker checkpoints that vLLM converts with
Qwen3ForSequenceClassification, classifier_from_token=["no", "yes"], and
is_original_qwen3_reranker=True. This path returns one scalar score tensor
per request through the same pooler_output contract.
Scope¶
Current scope is intentionally narrow:
- text
embedrequests withrunner="pooling"and embedding-capable pooler configs (pooler_config.taskunset orpooler_config.task="embed") - original Qwen3 reranker
classifyrequests withQwen3ForSequenceClassification,classifier_from_token=["no", "yes"], andis_original_qwen3_reranker=True - decoder-style text models that expose token hidden states through the MLX transformer body
- sequence embeddings from the final prompt-token hidden state with LAST pooling and L2 normalization on the paged Metal V1 path
- Qwen3 reranker cross-encoder scores from the final prompt-token hidden state,
using
lm_headfor untied checkpoints orembed_tokens.as_linearwhen word embeddings are tied
Unsupported¶
The Metal runner rejects these cases with diagnostic errors:
- generic classification heads, generic reranking models, and late interaction
- sequence pooling strategies other than LAST (
MEAN,CLS,ALL,STEP) - token-level pooling
- chunked long-input embedding aggregation (
enable_chunked_processing) - non-paged pooling execution
- multimodal embeddings and scheduled encoder inputs
- prompt embeddings
- unsafe dimension requests
Direct model-provided embedding tensors are intentionally out of scope for this MVP. Add that path only after a real model requires it and the output contract is validated end to end.
Usage¶
Set VLLM_METAL_USE_PAGED_ATTENTION=1 for the current text pooling MVP.
Offline Embeddings¶
from vllm import LLM
llm = LLM(
model="mlx-community/Qwen3-Embedding-0.6B-8bit",
runner="pooling",
max_model_len=512,
)
outputs = llm.embed(["hello metal", "semantic search"])
print(len(outputs), len(outputs[0].outputs.embedding))
Embedding Server¶
VLLM_ENABLE_V1_MULTIPROCESSING=0 \
VLLM_METAL_USE_PAGED_ATTENTION=1 \
VLLM_METAL_MEMORY_FRACTION=auto \
vllm serve mlx-community/Qwen3-Embedding-0.6B-8bit \
--runner pooling \
--max-model-len 512
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/Qwen3-Embedding-0.6B-8bit","input":["hello metal","semantic search"]}'
Offline Qwen3 Reranking¶
Original Qwen3 reranker checkpoints need vLLM's sequence-classification
overrides. LLM.score can format the query/document pair for this checkpoint
without a separate local template file.
from vllm import LLM
llm = LLM(
model="mku64/Qwen3-Reranker-0.6B-mlx-8Bit",
revision="ba80418a47fa1c4368a6c2287b0e449904063576",
runner="pooling",
max_model_len=512,
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
"is_original_qwen3_reranker": True,
},
)
outputs = llm.score(
["What is the capital of China?"],
["The capital of China is Beijing."],
)
print(outputs[0].outputs.score)
Qwen3 Reranking Server¶
VLLM_ENABLE_V1_MULTIPROCESSING=0 \
VLLM_METAL_USE_PAGED_ATTENTION=1 \
VLLM_METAL_MEMORY_FRACTION=auto \
vllm serve mku64/Qwen3-Reranker-0.6B-mlx-8Bit \
--revision ba80418a47fa1c4368a6c2287b0e449904063576 \
--runner pooling \
--max-model-len 512 \
--hf-overrides '{
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
"is_original_qwen3_reranker": true
}'
curl http://localhost:8000/score \
-H "Content-Type: application/json" \
-d '{"text_1":["What is the capital of China?"],"text_2":["The capital of China is Beijing."]}'
Validation¶
Do not add a model row to Supported Models until a real
LLM.embed, /v1/embeddings, LLM.score, or /score smoke passes on Apple
Silicon with the model name, revision, command, and output shape recorded.