Skip to content

Recommended Model and Feature MatricesΒΆ

Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components. For this reason, until we land more capabilities, we recommend starting from this list of stress tested models and features below.

We are still landing components in tpu-inference that will improve performance for larger scale, higher complexity models (XL MoE, +vision encoders, MLA, etc.).

If you’d like us to prioritize something specific, please submit a GitHub feature request here.

🚦 Status Legend
  • βœ… Passing: Tested and works as expected. Ready for use.
  • ❌ Failing: Known to be broken or not functional. Help is wanted to fix this!
  • πŸ§ͺ Experimental: Works, but unoptimized or pending community validation.
  • πŸ“ Planned: Not yet implemented, but on the official roadmap.
  • ⛔️ Unplanned: There is no benefit to adding this.
  • ❓ Untested: The functionality exists but has not been recently or thoroughly verified.

These tables show the models currently tested for accuracy and performance.

ModelsΒΆ

Model Type Unit Test Correctness Test Performance Test
Qwen/Qwen2.5-VL-7B-Instruct Multimodal βœ… βœ… βœ…
google/gemma-3-27b-it Text βœ… βœ… βœ…
meta-llama/Llama-3.1-8B-Instruct Text βœ… βœ… βœ…
meta-llama/Llama-3.3-70B-Instruct Text βœ… βœ… βœ…
Qwen/Qwen3-30B-A3B Text βœ… βœ… βœ…
Qwen/Qwen3-32B Text βœ… βœ… βœ…
Qwen/Qwen3-4B Text βœ… βœ… βœ…
Qwen/Qwen3-Coder-480B-A35B-Instruct Text βœ… βœ… βœ…
Qwen/Qwen3.5-397B-A17B Text βœ… βœ… βœ…
google/gemma-4-26B-A4B-it Multimodal βœ… βœ… ❌
google/gemma-4-31B-it Multimodal βœ… βœ… ❌
openai/gpt-oss-120b Text βœ… βœ… ❓
deepseek-ai/DeepSeek-R1 Text βœ… ❓ ❓
moonshotai/Kimi-K2.6 Text βœ… ❓ ❓
deepseek-ai/DeepSeek-OCR Multimodal ❓ ❓ ❓
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal ❓ ❓ ❓
Qwen/Qwen3-VL-8B-Instruct Multimodal ❓ ❓ ❓
Qwen/Qwen3.5-9B Multimodal ❓ ❓ ❓
deepseek-ai/DeepSeek-Math-V2 Text ❓ ❓ ❓
deepseek-ai/DeepSeek-V3.1 Text ❓ ❓ ❓
deepseek-ai/DeepSeek-V3.2 Text ❓ ❓ ❓
deepseek-ai/DeepSeek-V3.2-Speciale Text ❓ ❓ ❓
MiniMaxAI/MiniMax-M2.5 Text ❓ ❓ ❓
moonshotai/Kimi-K2-Thinking Text ❓ ❓ ❓
openai/gpt-oss-20b Text ❓ ❓ ❓
zai-org/GLM-5 Text ❓ ❓ ❓
Model Type Unit Test Correctness Test Performance Test
google/gemma-3-27b-it Text βœ… βœ… βœ…
meta-llama/Llama-3.1-8B-Instruct Text βœ… βœ… βœ…
meta-llama/Llama-3.3-70B-Instruct Text βœ… βœ… βœ…
Qwen/Qwen3-30B-A3B Text βœ… βœ… βœ…
Qwen/Qwen3-4B Text βœ… βœ… βœ…
Qwen/Qwen3-Coder-480B-A35B-Instruct Text βœ… βœ… βœ…
Qwen/Qwen3.5-397B-A17B Text βœ… βœ… βœ…
Qwen/Qwen2.5-VL-7B-Instruct Multimodal βœ… βœ… ❌
Qwen/Qwen3-Embedding-8B Embedding βœ… βœ… ❓
deepseek-ai/DeepSeek-R1 Text βœ… βœ… ❓
openai/gpt-oss-120b Text βœ… βœ… ❓
google/gemma-4-31B-it Multimodal βœ… ❌ ❓
google/gemma-4-E2B-it Multimodal βœ… ❌ ❓
google/gemma-4-E4B-it Multimodal βœ… ❌ ❓
Qwen/Qwen3-32B Text βœ… ❌ ❓
moonshotai/Kimi-K2.6 Text βœ… ❓ ❓
google/gemma-4-26B-A4B-it Multimodal ❌ ❓ ❓
deepseek-ai/DeepSeek-OCR Multimodal ❓ ❓ ❓
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal ❓ ❓ ❓
Qwen/Qwen3-VL-8B-Instruct Multimodal ❓ ❓ ❓
Qwen/Qwen3.5-9B Multimodal ❓ ❓ ❓
deepseek-ai/DeepSeek-Math-V2 Text ❓ ❓ ❓
deepseek-ai/DeepSeek-V3.1 Text ❓ ❓ ❓
deepseek-ai/DeepSeek-V3.2 Text ❓ ❓ ❓
deepseek-ai/DeepSeek-V3.2-Speciale Text ❓ ❓ ❓
MiniMaxAI/MiniMax-M2.5 Text ❓ ❓ ❓
moonshotai/Kimi-K2-Thinking Text ❓ ❓ ❓
openai/gpt-oss-20b Text ❓ ❓ ❓
zai-org/GLM-5 Text ❓ ❓ ❓

Embedding ModelsΒΆ

Model Type UnitTest Accuracy/Correctness Benchmark
Qwen/Qwen2.5-VL-7B-Instruct Multimodal βœ… Passing βœ… Passing βœ… Passing
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3-VL-8B-Instruct Multimodal ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3.5-9B Multimodal ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-OCR Multimodal ❓ Untested ❓ Untested ❓ Untested
google/gemma-4-26B-A4B-it Multimodal βœ… Passing βœ… Passing βœ… Passing
google/gemma-4-31B-it Multimodal βœ… Passing βœ… Passing βœ… Passing
MiniMaxAI/MiniMax-M2.5 Text ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3-30B-A3B Text βœ… Passing βœ… Passing βœ… Passing
Qwen/Qwen3-32B Text βœ… Passing βœ… Passing βœ… Passing
Qwen/Qwen3-4B Text βœ… Passing βœ… Passing βœ… Passing
Qwen/Qwen3-Coder-480B-A35B-Instruct Text βœ… Passing βœ… Passing βœ… Passing
Qwen/Qwen3.5-397B-A17B Text βœ… Passing βœ… Passing βœ… Passing
deepseek-ai/DeepSeek-Math-V2 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-R1 Text βœ… Passing ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.1 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.2 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.2-Speciale Text ❓ Untested ❓ Untested ❓ Untested
google/gemma-3-27b-it Text βœ… Passing βœ… Passing βœ… Passing
meta-llama/Llama-3.1-8B-Instruct Text βœ… Passing βœ… Passing βœ… Passing
meta-llama/Llama-3.3-70B-Instruct Text βœ… Passing βœ… Passing βœ… Passing
moonshotai/Kimi-K2-Thinking Text ❓ Untested ❓ Untested ❓ Untested
moonshotai/Kimi-K2.6 Text βœ… Passing ❓ Untested ❓ Untested
openai/gpt-oss-120b Text βœ… Passing βœ… Passing ❓ Untested
openai/gpt-oss-20b Text ❓ Untested ❓ Untested ❓ Untested
zai-org/GLM-5 Text ❓ Untested ❓ Untested ❓ Untested
Model Type UnitTest Accuracy/Correctness Benchmark
Qwen/Qwen2.5-VL-7B-Instruct Multimodal βœ… Passing βœ… Passing βœ… Passing
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3-VL-8B-Instruct Multimodal ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3.5-9B Multimodal ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-OCR Multimodal ❓ Untested ❓ Untested ❓ Untested
google/gemma-4-26B-A4B-it Multimodal not enough HBM not enough HBM not enough HBM
google/gemma-4-31B-it Multimodal βœ… Passing βœ… Passing not enough HBM
MiniMaxAI/MiniMax-M2.5 Text ❓ Untested ❓ Untested ❓ Untested
Qwen/Qwen3-30B-A3B Text βœ… Passing βœ… Passing βœ… Passing
Qwen/Qwen3-32B Text βœ… Passing βœ… Passing βœ… Passing
Qwen/Qwen3-4B Text βœ… Passing βœ… Passing βœ… Passing
Qwen/Qwen3-Coder-480B-A35B-Instruct Text not enough HBM not enough HBM not enough HBM
Qwen/Qwen3.5-397B-A17B Text not enough HBM not enough HBM not enough HBM
deepseek-ai/DeepSeek-Math-V2 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-R1 Text not enough HBM ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.1 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.2 Text ❓ Untested ❓ Untested ❓ Untested
deepseek-ai/DeepSeek-V3.2-Speciale Text ❓ Untested ❓ Untested ❓ Untested
google/gemma-3-27b-it Text βœ… Passing βœ… Passing βœ… Passing
meta-llama/Llama-3.1-8B-Instruct Text βœ… Passing βœ… Passing βœ… Passing
meta-llama/Llama-3.3-70B-Instruct Text βœ… Passing βœ… Passing βœ… Passing
moonshotai/Kimi-K2-Thinking Text ❓ Untested ❓ Untested ❓ Untested
moonshotai/Kimi-K2.6 Text not enough HBM ❓ Untested ❓ Untested
openai/gpt-oss-120b Text not enough HBM not enough HBM ❓ Untested
openai/gpt-oss-20b Text ❓ Untested ❓ Untested ❓ Untested
zai-org/GLM-5 Text ❓ Untested ❓ Untested ❓ Untested

This table shows the features currently tested for accuracy and performance.

Feature Flax Torchax Default
async scheduler βœ… βœ… βœ…
Chunked Prefill βœ… βœ… βœ…
DCN-based P/D disaggregation βœ… βœ… βœ…
LoRA_Torch βœ… βœ… βœ…
Out-of-tree model support βœ… βœ… βœ…
Prefix Caching βœ… βœ… βœ…
Single Program Multi Data βœ… βœ… βœ…
Speculative Decoding: Ngram βœ… βœ… βœ…
KV Cache Offload βœ… ❌ βœ…
Multimodal Inputs βœ… ❌ βœ…
Speculative Decoding: Eagle3 βœ… ❌ βœ…
hybrid kv cache ❓ ❓ ❓
multi-host ❓ ❓ ❓
runai_model_streamer_loader ❓ ❓ ❓
sampling_params ❓ ❓ ❓
Single-Host-P-D-disaggregation ❓ ❓ ❓
structured_decoding ❓ ❓ ❓

Feature Flax Torchax Default
async scheduler βœ… βœ… βœ…
Chunked Prefill βœ… βœ… βœ…
KV Cache Offload βœ… βœ… βœ…
LoRA_Torch βœ… βœ… βœ…
Out-of-tree model support βœ… βœ… βœ…
Prefix Caching βœ… βœ… βœ…
Single Program Multi Data βœ… βœ… βœ…
Speculative Decoding: Eagle3 βœ… βœ… βœ…
Speculative Decoding: Ngram βœ… βœ… βœ…
DCN-based P/D disaggregation βœ… ❌ βœ…
Multimodal Inputs βœ… ❌ βœ…
Single-Host-P-D-disaggregation ❌ ❌ ❌
runai_model_streamer_loader ❓ ❌ ❓
Step Pooling (Embedding) ❓ ❓ ❌
hybrid kv cache ❓ ❓ ❓
multi-host ❓ ❓ ❓
sampling_params ❓ ❓ ❓
structured_decoding ❓ ❓ ❓

Kernel SupportΒΆ

This table tracks high-level correctness and performance validation for distributed compute kernels.

Feature CorrectnessTest PerformanceTest
Collective Communication Matmul βœ… ❓
MLA ❓ ❓
MoE ❓ ❓
Quantized Attention ❓ ❓
Quantized KV Cache ❓ ❓
Quantized Matmul ❓ ❓
Ragged Paged Attention V3 βœ… βœ…

Microbenchmark Kernel SupportΒΆ

This section outlines the detailed hardware and precision validation for our core microbenchmark kernels.

Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16
Moe Fused MoE ❓ ❓ ❓ ❓ ❓ ❓
gmm ❓ ❓ ❓ ❓ ❓ ❓
Dense All‑gather matmul ❓ ❓ ❓ ❓ ❓ ❓
Attention Generic Ragged Paged
Attention V3
❓ ❓ ❓ ❓ ❓ ❓
MLA ❓ ❓ ❓ ❓ ❓ ❓
Ragged Paged
Attention V3 Head_Dim
64
❓ ❓ ❓ ❓ ❓ ❓

Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16
Moe Fused MoE ❓ ❓ ❓ ❓ ❓ ❓
gmm ❓ ❓ ❓ ❓ ❓ ❓
Dense All‑gather matmul ❓ ❓ ❓ ❓ ❓ ❓
Attention Generic Ragged Paged
Attention V3
❓ ❓ ❓ ❓ ❓ ❓
MLA ❓ ❓ ❓ ❓ ❓ ❓
Ragged Paged
Attention V3 Head_Dim
64
❓ ❓ ❓ ❓ ❓ ❓

Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Parallelism SupportΒΆ

This table shows the current parallelism support status.

Feature Flax Torchax
Single-host Multi-host Single-host Multi-host
PP βœ… βœ… βœ… βœ…
DP βœ… ❓ βœ… ❓
EP βœ… ❓ βœ… ❓
TP βœ… ❓ βœ… ❓
CP ❓ ❓ ❓ ❓
SP (vote to prioritize) ❓ ❓ ❓ ❓

Feature Flax Torchax
Single-host Multi-host Single-host Multi-host
PP βœ… βœ… βœ… βœ…
EP βœ… ❓ βœ… ❓
TP βœ… ❓ βœ… ❓
DP ❌ ❓ βœ… ❓
CP ❓ ❓ ❓ ❓
SP (vote to prioritize) ❓ ❓ ❓ ❓

Quantization SupportΒΆ

This table shows the current quantization support status.

Checkpoint dtype Method Supported
Hardware Acceleration
Flax Torchax
FP4 W4A16 mxfp4 v7 ❓ ❓
FP8 W8A16 compressed-tensor v7 ❓ ❓
FP8 W8A8 compressed-tensor v7 ❓ ❓
INT4 W4A16 awq v5, v6 ❓ ❓
INT8 W8A8 compressed-tensor v5, v6 ❓ ❓

Note: - This table only tests checkpoint loading compatibility.

Checkpoint dtype Method Supported
Hardware Acceleration
Flax Torchax
FP4 W4A16 mxfp4 v7 ❓ ❓
FP8 W8A16 compressed-tensor v7 ❓ ❓
FP8 W8A8 compressed-tensor v7 ❓ ❓
INT4 W4A16 awq v5, v6 ❓ ❓
INT8 W8A8 compressed-tensor v5, v6 ❓ ❓
NVFP4 W4A16 modelopt_fp4 v7 ❓ ❓

Note: - This table only tests checkpoint loading compatibility.