Recommended Model and Feature MatricesΒΆ
Although vLLM TPUβs new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components. For this reason, until we land more capabilities, we recommend starting from this list of stress tested models and features below.
We are still landing components in tpu-inference that will improve performance for larger scale, higher complexity models (XL MoE, +vision encoders, MLA, etc.).
If youβd like us to prioritize something specific, please submit a GitHub feature request here.
π¦ Status Legend
- β Passing: Tested and works as expected. Ready for use.
- β Failing: Known to be broken or not functional. Help is wanted to fix this!
- π§ͺ Experimental: Works, but unoptimized or pending community validation.
- π Planned: Not yet implemented, but on the official roadmap.
- βοΈ Unplanned: There is no benefit to adding this.
- β Untested: The functionality exists but has not been recently or thoroughly verified.
Recommended ModelsΒΆ
These tables show the models currently tested for accuracy and performance.
ModelsΒΆ
Embedding ModelsΒΆ
| Model | Type | UnitTest | Accuracy/Correctness | Benchmark |
|---|---|---|---|---|
| Qwen/Qwen2.5-VL-7B-Instruct | Multimodal | β Passing | β Passing | β Passing |
| Qwen/Qwen3-Omni-30B-A3B-Instruct | Multimodal | β Untested | β Untested | β Untested |
| Qwen/Qwen3-VL-8B-Instruct | Multimodal | β Untested | β Untested | β Untested |
| Qwen/Qwen3.5-9B | Multimodal | β Untested | β Untested | β Untested |
| deepseek-ai/DeepSeek-OCR | Multimodal | β Untested | β Untested | β Untested |
| google/gemma-4-26B-A4B-it | Multimodal | β Passing | β Passing | β Passing |
| google/gemma-4-31B-it | Multimodal | β Passing | β Passing | β Passing |
| MiniMaxAI/MiniMax-M2.5 | Text | β Untested | β Untested | β Untested |
| Qwen/Qwen3-30B-A3B | Text | β Passing | β Passing | β Passing |
| Qwen/Qwen3-32B | Text | β Passing | β Passing | β Passing |
| Qwen/Qwen3-4B | Text | β Passing | β Passing | β Passing |
| Qwen/Qwen3-Coder-480B-A35B-Instruct | Text | β Passing | β Passing | β Passing |
| Qwen/Qwen3.5-397B-A17B | Text | β Passing | β Passing | β Passing |
| deepseek-ai/DeepSeek-Math-V2 | Text | β Untested | β Untested | β Untested |
| deepseek-ai/DeepSeek-R1 | Text | β Passing | β Untested | β Untested |
| deepseek-ai/DeepSeek-V3.1 | Text | β Untested | β Untested | β Untested |
| deepseek-ai/DeepSeek-V3.2 | Text | β Untested | β Untested | β Untested |
| deepseek-ai/DeepSeek-V3.2-Speciale | Text | β Untested | β Untested | β Untested |
| google/gemma-3-27b-it | Text | β Passing | β Passing | β Passing |
| meta-llama/Llama-3.1-8B-Instruct | Text | β Passing | β Passing | β Passing |
| meta-llama/Llama-3.3-70B-Instruct | Text | β Passing | β Passing | β Passing |
| moonshotai/Kimi-K2-Thinking | Text | β Untested | β Untested | β Untested |
| moonshotai/Kimi-K2.6 | Text | β Passing | β Untested | β Untested |
| openai/gpt-oss-120b | Text | β Passing | β Passing | β Untested |
| openai/gpt-oss-20b | Text | β Untested | β Untested | β Untested |
| zai-org/GLM-5 | Text | β Untested | β Untested | β Untested |
| Model | Type | UnitTest | Accuracy/Correctness | Benchmark |
|---|---|---|---|---|
| Qwen/Qwen2.5-VL-7B-Instruct | Multimodal | β Passing | β Passing | β Passing |
| Qwen/Qwen3-Omni-30B-A3B-Instruct | Multimodal | β Untested | β Untested | β Untested |
| Qwen/Qwen3-VL-8B-Instruct | Multimodal | β Untested | β Untested | β Untested |
| Qwen/Qwen3.5-9B | Multimodal | β Untested | β Untested | β Untested |
| deepseek-ai/DeepSeek-OCR | Multimodal | β Untested | β Untested | β Untested |
| google/gemma-4-26B-A4B-it | Multimodal | not enough HBM | not enough HBM | not enough HBM |
| google/gemma-4-31B-it | Multimodal | β Passing | β Passing | not enough HBM |
| MiniMaxAI/MiniMax-M2.5 | Text | β Untested | β Untested | β Untested |
| Qwen/Qwen3-30B-A3B | Text | β Passing | β Passing | β Passing |
| Qwen/Qwen3-32B | Text | β Passing | β Passing | β Passing |
| Qwen/Qwen3-4B | Text | β Passing | β Passing | β Passing |
| Qwen/Qwen3-Coder-480B-A35B-Instruct | Text | not enough HBM | not enough HBM | not enough HBM |
| Qwen/Qwen3.5-397B-A17B | Text | not enough HBM | not enough HBM | not enough HBM |
| deepseek-ai/DeepSeek-Math-V2 | Text | β Untested | β Untested | β Untested |
| deepseek-ai/DeepSeek-R1 | Text | not enough HBM | β Untested | β Untested |
| deepseek-ai/DeepSeek-V3.1 | Text | β Untested | β Untested | β Untested |
| deepseek-ai/DeepSeek-V3.2 | Text | β Untested | β Untested | β Untested |
| deepseek-ai/DeepSeek-V3.2-Speciale | Text | β Untested | β Untested | β Untested |
| google/gemma-3-27b-it | Text | β Passing | β Passing | β Passing |
| meta-llama/Llama-3.1-8B-Instruct | Text | β Passing | β Passing | β Passing |
| meta-llama/Llama-3.3-70B-Instruct | Text | β Passing | β Passing | β Passing |
| moonshotai/Kimi-K2-Thinking | Text | β Untested | β Untested | β Untested |
| moonshotai/Kimi-K2.6 | Text | not enough HBM | β Untested | β Untested |
| openai/gpt-oss-120b | Text | not enough HBM | not enough HBM | β Untested |
| openai/gpt-oss-20b | Text | β Untested | β Untested | β Untested |
| zai-org/GLM-5 | Text | β Untested | β Untested | β Untested |
Recommended FeaturesΒΆ
This table shows the features currently tested for accuracy and performance.
| Feature | Flax | Torchax | Default |
|---|---|---|---|
| async scheduler | β | β | β |
| Chunked Prefill | β | β | β |
| DCN-based P/D disaggregation | β | β | β |
| LoRA_Torch | β | β | β |
| Out-of-tree model support | β | β | β |
| Prefix Caching | β | β | β |
| Single Program Multi Data | β | β | β |
| Speculative Decoding: Ngram | β | β | β |
| KV Cache Offload | β | β | β |
| Multimodal Inputs | β | β | β |
| Speculative Decoding: Eagle3 | β | β | β |
| hybrid kv cache | β | β | β |
| multi-host | β | β | β |
| runai_model_streamer_loader | β | β | β |
| sampling_params | β | β | β |
| Single-Host-P-D-disaggregation | β | β | β |
| structured_decoding | β | β | β |
| Feature | Flax | Torchax | Default |
|---|---|---|---|
| async scheduler | β | β | β |
| Chunked Prefill | β | β | β |
| KV Cache Offload | β | β | β |
| LoRA_Torch | β | β | β |
| Out-of-tree model support | β | β | β |
| Prefix Caching | β | β | β |
| Single Program Multi Data | β | β | β |
| Speculative Decoding: Eagle3 | β | β | β |
| Speculative Decoding: Ngram | β | β | β |
| DCN-based P/D disaggregation | β | β | β |
| Multimodal Inputs | β | β | β |
| Single-Host-P-D-disaggregation | β | β | β |
| runai_model_streamer_loader | β | β | β |
| Step Pooling (Embedding) | β | β | β |
| hybrid kv cache | β | β | β |
| multi-host | β | β | β |
| sampling_params | β | β | β |
| structured_decoding | β | β | β |
Kernel SupportΒΆ
This table tracks high-level correctness and performance validation for distributed compute kernels.
| Feature | CorrectnessTest | PerformanceTest |
|---|---|---|
| Collective Communication Matmul | β | β |
| MLA | β | β |
| MoE | β | β |
| Quantized Attention | β | β |
| Quantized KV Cache | β | β |
| Quantized Matmul | β | β |
| Ragged Paged Attention V3 | β | β |
Microbenchmark Kernel SupportΒΆ
This section outlines the detailed hardware and precision validation for our core microbenchmark kernels.
| Category | Test | W16A16 | W8A8 | W8A16 | W4A4 | W4A8 | W4A16 |
|---|---|---|---|---|---|---|---|
| Moe | Fused MoE | β | β | β | β | β | β |
| gmm | β | β | β | β | β | β | |
| Dense | Allβgather matmul | β | β | β | β | β | β |
| Attention | Generic Ragged Paged Attention V3 |
β | β | β | β | β | β |
| MLA | β | β | β | β | β | β | |
| Ragged Paged Attention V3 Head_Dim 64 |
β | β | β | β | β | β |
Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.
| Category | Test | W16A16 | W8A8 | W8A16 | W4A4 | W4A8 | W4A16 |
|---|---|---|---|---|---|---|---|
| Moe | Fused MoE | β | β | β | β | β | β |
| gmm | β | β | β | β | β | β | |
| Dense | Allβgather matmul | β | β | β | β | β | β |
| Attention | Generic Ragged Paged Attention V3 |
β | β | β | β | β | β |
| MLA | β | β | β | β | β | β | |
| Ragged Paged Attention V3 Head_Dim 64 |
β | β | β | β | β | β |
Note: - For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.
Parallelism SupportΒΆ
This table shows the current parallelism support status.
| Feature | Flax | Torchax | ||
|---|---|---|---|---|
| Single-host | Multi-host | Single-host | Multi-host | |
| PP | β | β | β | β |
| DP | β | β | β | β |
| EP | β | β | β | β |
| TP | β | β | β | β |
| CP | β | β | β | β |
| SP (vote to prioritize) | β | β | β | β |
| Feature | Flax | Torchax | ||
|---|---|---|---|---|
| Single-host | Multi-host | Single-host | Multi-host | |
| PP | β | β | β | β |
| EP | β | β | β | β |
| TP | β | β | β | β |
| DP | β | β | β | β |
| CP | β | β | β | β |
| SP (vote to prioritize) | β | β | β | β |
Quantization SupportΒΆ
This table shows the current quantization support status.
| Checkpoint dtype | Method | Supported Hardware Acceleration |
Flax | Torchax |
|---|---|---|---|---|
| FP4 W4A16 | mxfp4 | v7 | β | β |
| FP8 W8A16 | compressed-tensor | v7 | β | β |
| FP8 W8A8 | compressed-tensor | v7 | β | β |
| INT4 W4A16 | awq | v5, v6 | β | β |
| INT8 W8A8 | compressed-tensor | v5, v6 | β | β |
Note: - This table only tests checkpoint loading compatibility.
| Checkpoint dtype | Method | Supported Hardware Acceleration |
Flax | Torchax |
|---|---|---|---|---|
| FP4 W4A16 | mxfp4 | v7 | β | β |
| FP8 W8A16 | compressed-tensor | v7 | β | β |
| FP8 W8A8 | compressed-tensor | v7 | β | β |
| INT4 W4A16 | awq | v5, v6 | β | β |
| INT8 W8A8 | compressed-tensor | v5, v6 | β | β |
| NVFP4 W4A16 | modelopt_fp4 | v7 | β | β |
Note: - This table only tests checkpoint loading compatibility.