Skip to content

Supported Models

vllm-metal supports text language models and a small set of native multimodal models on Apple Silicon. Multimodal support is currently vision-only and runs on the paged backend.

Legend

Symbol Meaning
Supported model/feature
🔵 Experimental supported model/feature
Not supported model/feature
🟡 Not tested or verified

Each row tracks a model family. The Example checkpoint is one configuration we have actually run on Metal — a starting point, not the only checkpoint that works. Other sizes and quantizations of the same family generally work too; per-machine details (chip, RAM, macOS, reference match) and the full change history live in the project's PRs, not in this table. If a model or checkpoint does not work, please open an issue rather than adding more rows or example checkpoints.

Text Pooling

Metal V1 has experimental text-only pooling support. See Text Pooling for scope, usage, and validation guidance. The reranker requires Qwen3 sequence-classification hf_overrides.

Model Support Runner Example checkpoint
Qwen3-Embedding 🔵 pooling / embed (paged) mlx-community/Qwen3-Embedding-0.6B-8bit
Qwen3-Reranker 🔵 pooling / classify (paged) mku64/Qwen3-Reranker-0.6B-mlx-8Bit

Multimodal Language Models

Native multimodal support currently targets image-only vision-language requests on the paged backend.

Model Support Runner Scope Example checkpoint
Qwen3-VL 🔵 native multimodal paged generation image input, no video mlx-community/Qwen3-VL-4B-Instruct-4bit
PaddleOCR-VL 🔵 native multimodal paged generation image input, no video PaddlePaddle/PaddleOCR-VL-1.6

Text-Only Language Models

Automatic Prefix Cache is the default behavior when you do not pass --enable-prefix-caching. Since #283, unified paged-KV models reuse shared prefixes by default. Upstream vLLM keeps it off for hybrid/Mamba models, so those rows stay . These values describe default engine behavior, not exhaustive per-model benchmarking on Metal.

HF AWQ checkpoints load through mlx-lm's _transform_awq_weights repack, with an entry-point preflight that normalizes AutoAWQ aliases (w_bit, q_group_size, uppercase "GEMM") and rejects unsupported variants (gemv, bits != 4, group_size != 128, zero_point=false) before model state is built. Verified for Qwen2.5, Llama 3, and Mistral (#340, #381).

Model Support Attention Kernel Automatic Prefix Cache Example checkpoint
Qwen3 GQA (paged) Qwen/Qwen3-0.6B
Qwen3.5 / 3.6 Hybrid SDPA + GDN linear (3.6 adds MoE) Qwen/Qwen3.5-0.8B
Qwen3-Next Hybrid SDPA + GDN linear mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit
Gemma 4 🔵 GQA + per-layer sliding window + YOCO mlx-community/gemma-4-E2B-it
Gemma 3 GQA (paged) mlx-community/gemma-3-1b-it-qat-4bit
Llama 3 GQA (paged) mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
Mistral-7B GQA (paged) mlx-community/Mistral-7B-Instruct-v0.3-4bit
Mistral-Small-24B 🔵 GQA (paged) mlx-community/Mistral-Small-24B-Instruct-2501-4bit
GPT-OSS 🔵 Sink attention (paged) openai/gpt-oss-20b
GLM-4.5 🟡 MLA (paged latent cache, MLX SDPA — no Metal kernel) 🟡
MiniCPM3-4B MLA (paged latent cache) mlx-community/MiniCPM3-4B-4bit
GLM-4.7-Flash 🔵 GQA (paged) mlx-community/GLM-4.7-Flash-4bit
DeepSeek-R1-Distill-Qwen GQA (paged) mlx-community/DeepSeek-R1-Distill-Qwen-7B-3bit
Phi-4-mini GQA packed qkv (paged) microsoft/Phi-4-mini-instruct
Phi-3.5-mini MHA packed qkv (paged) mlx-community/Phi-3.5-mini-instruct-4bit
Qwen2.5 GQA (paged) mlx-community/Qwen2.5-7B-Instruct-4bit
Qwen2-7B GQA (paged) mlx-community/Qwen2-7B-Instruct-4bit
Yi-1.5-9B GQA (paged, LlamaForCausalLM) mlx-community/Yi-1.5-9B-Chat-4bit
SmolLM3-3B GQA (paged) mlx-community/SmolLM3-3B-4bit
Granite 3.3 🔵 GQA (paged) mlx-community/granite-3.3-8b-instruct-4bit