Recommended Model and Feature Matrices¶

Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components. For this reason, until we land more capabilities, we recommend starting from this list of stress tested models and features below.

We are still landing components in tpu-inference that will improve performance for larger scale, higher complexity models (XL MoE, +vision encoders, MLA, etc.).

If you’d like us to prioritize something specific, please submit a GitHub feature request here.

🚦 Status Legend

✅ Passing: Tested and works as expected. Ready for use.

❌ Failing: Known to be broken or not functional. Help is wanted to fix this!

🧪 Experimental: Works, but unoptimized or pending community validation.

📝 Planned: Not yet implemented, but on the official roadmap.

⛔️ Unplanned: There is no benefit to adding this.

❓ Untested: The functionality exists but has not been recently or thoroughly verified.

Recommended Models¶

These tables show the models currently tested for accuracy and performance.

Models¶

ReleaseNightly

Model	Type	Unit Test	Correctness Test	Performance Test
google/gemma-3-27b-it	Text	✅	✅	✅
meta-llama/Llama-3.1-8B-Instruct	Text	✅	✅	✅
meta-llama/Llama-3.3-70B-Instruct	Text	✅	✅	✅
Qwen/Qwen3-30B-A3B	Text	✅	✅	✅
Qwen/Qwen3-32B	Text	✅	✅	✅
Qwen/Qwen3-4B	Text	✅	✅	✅
Qwen/Qwen3-Coder-480B-A35B-Instruct	Text	✅	✅	✅
Qwen/Qwen3.5-397B-A17B	Text	✅	✅	✅
google/gemma-4-26B-A4B-it	Multimodal	✅	✅	❌
google/gemma-4-31B-it	Multimodal	✅	✅	❌
Qwen/Qwen2.5-VL-7B-Instruct	Multimodal	✅	✅	❌
Qwen/Qwen3-Embedding-8B	Embedding	✅	✅	❓
deepseek-ai/DeepSeek-R1	Text	✅	✅	❓
moonshotai/Kimi-K2.6	Text	✅	✅	❓
openai/gpt-oss-120b	Text	✅	✅	❓
google/gemma-4-E2B-it	Multimodal	✅	❌	❓
google/gemma-4-E4B-it	Multimodal	✅	❌	❓
deepseek-ai/DeepSeek-OCR	Multimodal	❓	❓	❓
Qwen/Qwen3-Omni-30B-A3B-Instruct	Multimodal	❓	❓	❓
Qwen/Qwen3-VL-8B-Instruct	Multimodal	❓	❓	❓
Qwen/Qwen3.5-9B	Multimodal	❓	❓	❓
deepseek-ai/DeepSeek-Math-V2	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.1	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.2	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.2-Speciale	Text	❓	❓	❓
MiniMaxAI/MiniMax-M2.5	Text	❓	❓	❓
moonshotai/Kimi-K2-Thinking	Text	❓	❓	❓
openai/gpt-oss-20b	Text	❓	❓	❓
zai-org/GLM-5	Text	❓	❓	❓

Model	Type	Unit Test	Correctness Test	Performance Test
google/gemma-4-26B-A4B-it	Multimodal	✅	✅	✅
google/gemma-4-31B-it	Multimodal	✅	✅	✅
google/gemma-3-27b-it	Text	✅	✅	✅
meta-llama/Llama-3.1-8B-Instruct	Text	✅	✅	✅
meta-llama/Llama-3.3-70B-Instruct	Text	✅	✅	✅
Qwen/Qwen3-30B-A3B	Text	✅	✅	✅
Qwen/Qwen3-4B	Text	✅	✅	✅
Qwen/Qwen3-Coder-480B-A35B-Instruct	Text	✅	✅	✅
Qwen/Qwen3.5-397B-A17B	Text	✅	✅	✅
Qwen/Qwen2.5-VL-7B-Instruct	Multimodal	✅	✅	❌
Qwen/Qwen3-Embedding-8B	Embedding	✅	✅	❓
deepseek-ai/DeepSeek-R1	Text	✅	✅	❓
moonshotai/Kimi-K2.6	Text	✅	✅	❓
openai/gpt-oss-120b	Text	✅	✅	❓
google/gemma-4-E2B-it	Multimodal	✅	❌	❓
google/gemma-4-E4B-it	Multimodal	✅	❌	❓
Qwen/Qwen3-32B	Text	❌	❓	❓
deepseek-ai/DeepSeek-OCR	Multimodal	❓	❓	❓
Qwen/Qwen3-Omni-30B-A3B-Instruct	Multimodal	❓	❓	❓
Qwen/Qwen3-VL-8B-Instruct	Multimodal	❓	❓	❓
Qwen/Qwen3.5-9B	Multimodal	❓	❓	❓
deepseek-ai/DeepSeek-Math-V2	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.1	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.2	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.2-Speciale	Text	❓	❓	❓
MiniMaxAI/MiniMax-M2.5	Text	❓	❓	❓
moonshotai/Kimi-K2-Thinking	Text	❓	❓	❓
openai/gpt-oss-20b	Text	❓	❓	❓
zai-org/GLM-5	Text	❓	❓	❓

Embedding Models¶

v7xv6e

Model	Type	UnitTest	Accuracy/Correctness	Benchmark
Qwen/Qwen3-Embedding-8B	Embedding	✅ Passing	✅ Passing	❓ Untested
Qwen/Qwen2.5-VL-7B-Instruct	Multimodal	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-Omni-30B-A3B-Instruct	Multimodal	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3-VL-8B-Instruct	Multimodal	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3.5-9B	Multimodal	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-OCR	Multimodal	❓ Untested	❓ Untested	❓ Untested
google/gemma-4-26B-A4B-it	Multimodal	✅ Passing	✅ Passing	✅ Passing
google/gemma-4-31B-it	Multimodal	✅ Passing	✅ Passing	✅ Passing
google/gemma-4-E2B-it	Multimodal	✅ Passing	✅ Passing	✅ Passing
google/gemma-4-E4B-it	Multimodal	✅ Passing	✅ Passing	✅ Passing
MiniMaxAI/MiniMax-M2.5	Text	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3-30B-A3B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-32B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-4B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-Coder-480B-A35B-Instruct	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3.5-397B-A17B	Text	✅ Passing	✅ Passing	✅ Passing
deepseek-ai/DeepSeek-Math-V2	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-R1	Text	✅ Passing	✅ Passing	❓ Untested
deepseek-ai/DeepSeek-V3.1	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-V3.2	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-V3.2-Speciale	Text	❓ Untested	❓ Untested	❓ Untested
google/gemma-3-27b-it	Text	✅ Passing	✅ Passing	✅ Passing
meta-llama/Llama-3.1-8B-Instruct	Text	✅ Passing	✅ Passing	✅ Passing
meta-llama/Llama-3.3-70B-Instruct	Text	✅ Passing	✅ Passing	✅ Passing
moonshotai/Kimi-K2-Thinking	Text	❓ Untested	❓ Untested	❓ Untested
moonshotai/Kimi-K2.6	Text	✅ Passing	✅ Passing	❓ Untested
openai/gpt-oss-120b	Text	✅ Passing	✅ Passing	❓ Untested
openai/gpt-oss-20b	Text	❓ Untested	❓ Untested	❓ Untested
zai-org/GLM-5	Text	❓ Untested	❓ Untested	❓ Untested

Model	Type	UnitTest	Accuracy/Correctness	Benchmark
Qwen/Qwen3-Embedding-8B	Embedding	✅ Passing	✅ Passing	❓ Untested
Qwen/Qwen2.5-VL-7B-Instruct	Multimodal	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-Omni-30B-A3B-Instruct	Multimodal	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3-VL-8B-Instruct	Multimodal	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3.5-9B	Multimodal	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-OCR	Multimodal	❓ Untested	❓ Untested	❓ Untested
google/gemma-4-26B-A4B-it	Multimodal	not enough HBM	not enough HBM	not enough HBM
google/gemma-4-31B-it	Multimodal	✅ Passing	✅ Passing	not enough HBM
google/gemma-4-E2B-it	Multimodal	✅ Passing	✅ Passing	✅ Passing
google/gemma-4-E4B-it	Multimodal	✅ Passing	✅ Passing	✅ Passing
MiniMaxAI/MiniMax-M2.5	Text	❓ Untested	❓ Untested	❓ Untested
Qwen/Qwen3-30B-A3B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-32B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-4B	Text	✅ Passing	✅ Passing	✅ Passing
Qwen/Qwen3-Coder-480B-A35B-Instruct	Text	not enough HBM	not enough HBM	not enough HBM
Qwen/Qwen3.5-397B-A17B	Text	not enough HBM	not enough HBM	not enough HBM
deepseek-ai/DeepSeek-Math-V2	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-R1	Text	not enough HBM	not enough HBM	❓ Untested
deepseek-ai/DeepSeek-V3.1	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-V3.2	Text	❓ Untested	❓ Untested	❓ Untested
deepseek-ai/DeepSeek-V3.2-Speciale	Text	❓ Untested	❓ Untested	❓ Untested
google/gemma-3-27b-it	Text	✅ Passing	✅ Passing	✅ Passing
meta-llama/Llama-3.1-8B-Instruct	Text	✅ Passing	✅ Passing	✅ Passing
meta-llama/Llama-3.3-70B-Instruct	Text	✅ Passing	✅ Passing	✅ Passing
moonshotai/Kimi-K2-Thinking	Text	❓ Untested	❓ Untested	❓ Untested
moonshotai/Kimi-K2.6	Text	not enough HBM	not enough HBM	❓ Untested
openai/gpt-oss-120b	Text	not enough HBM	not enough HBM	❓ Untested
openai/gpt-oss-20b	Text	❓ Untested	❓ Untested	❓ Untested
zai-org/GLM-5	Text	❓ Untested	❓ Untested	❓ Untested

Recommended Features¶

This table shows the features currently tested for accuracy and performance.

ReleaseNightly

Feature	Flax	Torchax	Default
async scheduler	✅	✅	✅
Chunked Prefill	✅	✅	✅
DCN-based P/D disaggregation	✅	✅	✅
KV Cache Offload	✅	✅	✅
LoRA_Torch	✅	✅	✅
Multimodal Inputs	✅	✅	✅
Out-of-tree model support	✅	✅	✅
Prefix Caching	✅	✅	✅
Single Program Multi Data	✅	✅	✅
Speculative Decoding: Eagle3	✅	✅	✅
Speculative Decoding: DFlash	✅	✅	✅
Speculative Decoding: Ngram	✅	✅	✅
hybrid kv cache	❓	❓	❓
multi-host	❓	❓	❓
runai_model_streamer_loader	❓	❓	❓
sampling_params	❓	❓	❓
Step Pooling (Embedding)	❓	❓	❓
structured_decoding	❓	❓	❓

Feature	Flax	Torchax	Default
async scheduler	✅	✅	✅
Chunked Prefill	✅	✅	✅
DCN-based P/D disaggregation	✅	✅	✅
KV Cache Offload	✅	✅	✅
LoRA_Torch	✅	✅	✅
Multimodal Inputs	✅	✅	✅
Out-of-tree model support	✅	✅	✅
Prefix Caching	✅	✅	✅
Single Program Multi Data	✅	✅	✅
Speculative Decoding: Eagle3	✅	✅	✅
Speculative Decoding: DFlash	✅	✅	✅
Speculative Decoding: Ngram	✅	✅	✅
hybrid kv cache	❓	❓	❓
multi-host	❓	❓	❓
runai_model_streamer_loader	❓	❓	❓
sampling_params	❓	❓	❓
Step Pooling (Embedding)	❓	❓	❓
structured_decoding	❓	❓	❓

Kernel Support¶

This table tracks high-level correctness and performance validation for distributed compute kernels.

Feature	CorrectnessTest	PerformanceTest
Collective Communication Matmul	✅	❓
MLA	❓	❓
MoE	❓	❓
Quantized Attention	❓	❓
Quantized KV Cache	❓	❓
Quantized Matmul	❓	❓
Ragged Paged Attention V3	✅	✅

Microbenchmark Kernel Support¶

This section outlines the detailed hardware and precision validation for our core microbenchmark kernels.

ReleaseNightly

Category	Test	W16A16	W8A8	W8A16	W4A4	W4A8	W4A16
Moe	Fused MoE	❓	❓	❓	❓	❓	❓
Moe	gmm	❓	❓	❓	❓	❓	❓
Dense	All‑gather matmul	❓	❓	❓	❓	❓	❓
Attention	Generic Ragged Paged Attention V3	❓	❓	❓	❓	❓	❓
	MLA	❓	❓	❓	❓	❓	❓
	Ragged Paged Attention V3 Head_Dim 64	❓	❓	❓	❓	❓	❓

Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Category	Test	W16A16	W8A8	W8A16	W4A4	W4A8	W4A16
Moe	Fused MoE	❓	❓	❓	❓	❓	❓
Moe	gmm	❓	❓	❓	❓	❓	❓
Dense	All‑gather matmul	❓	❓	❓	❓	❓	❓
Attention	Generic Ragged Paged Attention V3	❓	❓	❓	❓	❓	❓
	MLA	❓	❓	❓	❓	❓	❓
	Ragged Paged Attention V3 Head_Dim 64	❓	❓	❓	❓	❓	❓

Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Parallelism Support¶

This table shows the current parallelism support status.

ReleaseNightly

Feature	Flax		Torchax
Feature	Single-host	Multi-host	Single-host	Multi-host
PP	✅	✅	✅	✅
DP	✅	❓	✅	❓
EP	✅	❓	✅	❓
TP	✅	❓	✅	❓
CP	❓	❓	❓	❓
SP (vote to prioritize)	❓	❓	❓	❓

Feature	Flax		Torchax
Feature	Single-host	Multi-host	Single-host	Multi-host
PP	✅	✅	✅	✅
DP	✅	❓	✅	❓
EP	✅	❓	✅	❓
TP	✅	❓	❌	❓
CP	❓	❓	❓	❓
SP (vote to prioritize)	❓	❓	❓	❓

Quantization Support¶

This table shows the current quantization support status.

ReleaseNightly

Checkpoint dtype	Method	Supported Hardware Acceleration	Flax	Torchax
FP4 W4A16	mxfp4	v7	❓	❓
FP8 W8A16	compressed-tensor	v7	❓	❓
FP8 W8A8	compressed-tensor	v7	❓	❓
INT4 W4A16	awq	v5, v6	❓	❓
INT8 W8A8	compressed-tensor	v5, v6	❓	❓
NVFP4 W4A16	modelopt_fp4	v7	❓	❓

Note: - This table only tests checkpoint loading compatibility.

Checkpoint dtype	Method	Supported Hardware Acceleration	Flax	Torchax
FP4 W4A16	mxfp4	v7	❓	❓
FP8 W8A16	compressed-tensor	v7	❓	❓
FP8 W8A8	compressed-tensor	v7	❓	❓
INT4 W4A16	awq	v5, v6	❓	❓
INT8 W8A8	compressed-tensor	v5, v6	❓	❓
NVFP4 W4A16	modelopt_fp4	v7	❓	❓

Note: - This table only tests checkpoint loading compatibility.