vLLM Gaudi Plugin v0.21.0 Release Notes¶

Overview¶

This release is based on vLLM v0.21.0 and supports Intel® Gaudi® Software v1.24.0 with PyTorch 2.10.

Highlights¶

Removed lazy execution mode from CI, making eager execution the default for CI pipelines while retaining lazy mode support at runtime. (#996)
Introduced a new padding-aware bucketing strategy to improve memory utilization and reduce padding overhead. (#762)
Added W8A8 INT8 quantization with BF16 fallback via HPUCompressedTensorsW8A8Int8_BF16Fallback. (#1168)
Enabled FusedSDPA slicing to improve attention performance. (#1155)
Added an OpenAI-compatible /v1/models/switch entrypoint together with per-model tool-calling and FP8 configs for online model swapping. (#1258)
Added HPU-specific fixes for KV offload and asynchronous speculative decoding. (#1264)
Resolved NIXL connector issues in heterogeneous and homogeneous deployment scenarios. (#1511, #1503)

Performance¶

Introduced a new padding-aware bucketing strategy. (#762)
Enabled slicing for the FusedSDPA attention path. (#1155)
Skipped HPU graphs for long (query + context) prefills. (#1346)
Fixed guard breaks and improved warmup time for Qwen3 MoE. (#1329)
Improved selective_state_update performance. (#1291)
Removed splitting MoE decode layer compilation function. (#1313)
Optimized the visible block number for hybrid KV cache. (#1317)
Fine-tuned bucketing edge cases for longer contexts. (#1362)
Raised the default max_cudagraph_capture_size floor to 16384. (#1507)

Attention & KV Cache¶

Added prefix caching support for HPUMambaMixer2. (#1366)
Fixed the condition for materialized causal attn_bias. (#1433)
Fixed proper KV cache slot addressing for hybrid models. (#1327)
Fixed mamba_type comparison for GDN hybrid cache allocation. (#1449)
Fixed extra masking for batched prefill in GDN layers. (#1440)
Fixed HPU-specific bugs affecting KV offload and asynchronous speculative decoding. (#1264)

Quantization¶

Implemented HPUCompressedTensorsW8A8Int8_BF16Fallback for W8A8 INT8 quantization with BF16 fallback. (#1168)
Fixed Synapse GC compile failure for FP8-quantized models. (#1324)
Enabled Llama4 Maverick FP8 torch.compile support without breaking DeepSeek. (#1396)
Fixed GPT-OSS MxFP4 TP partitioning and quant_method matching. (#1498)
Added Granite-4.0-h calibration config. (#1270)
Fixed load failure of MxFP4 GPT-OSS-120B with expert parallel. (#1411)
Added hf_config parameter to HPU quantization config overrides. (#1349)

Plugin Core¶

Removed lazy execution mode from CI, making eager execution the default in CI pipelines. (#996)
Accepted PEP 440 versions in build detection. (#1351)
Patched torch.accelerator.empty_cache for HPU to fix import-order dependent cleanup failures. (#1430)
Removed matmul_qk output-tensor compatibility gate after 1.24.0. (#1409)
Removed transformers installation from vllm-gaudi. (#1494)
Prevented eager-mode environment variables from leaking into lazy-mode subprocesses. (#1510)
Fixed multiple upstream regressions across MoE, MLA, NIXL, attention, FP8, offloading, and platform modules. (#1279, #1311, #1338, #1342, #1354, #1375, #1377, #1403, #1421, #1428, #1442)
Ported fixes for the MoE fast path, dynamic shapes, kernel block sizes, and batched count operations. (#1453, #1458, #1459, #1460, #1469)

Serving & Infrastructure¶

Added torchaudio-free copies of CD Dockerfiles. (#1446)
Set Docker PT_HPU_LAZY_MODE=0 as the default auto-calculated value. (#1378)
Added OpenAI-compatible /v1/models/switch entrypoint integration. (#1258)
Added per-model tool-calling and FP8 configs. (#1258)
Enhanced process management for the online model swap example. (#1414)
Fixed NIXL connector v1 API signature mismatches for heterogeneous HPU. (#1503)
Fixed heterogeneous and homogeneous NIXL deployment issues for v0.21.0. (#1511)
Enabled defragmentation when contiguous PA was enabled. (#1400)
Clarified VLLM_PROMPT_BS_BUCKET_MAX runtime behavior in the documentation. (#1410)

Fixes¶

Fixed occasional Qwen3.5 segfaults. (#1500)
Fixed decode bucket generation for hybrid models with mismatched block sizes. (#1486)
Fixed HPU prompt_token_ids device placement for penalty sampling. (#1466)
Fixed decode bucket filter issues. (#1447)
Fixed decode bucketing in non-contiguous PA scenarios. (#1122)
Fixed MRoPE accuracy for Qwen models. (#1437)
Fixed warmup failures and multimodal graph warmup in PT compile-only mode. (#1392)
Flattened 3D inputs_embeds in HpuModelAdapter.forward. (#1381)
Fixed prompt logprobs gathering on Gaudi HPU. (#1405)
Fixed a regression in Mistral-Large-3-675B. (#1304)
Fixed Granite 4H block size calculations. (#1332)
Separated conv1d for Granite 4.0. (#1322)
Corrected get_supported_kernel_block_sizes. (#1384)
Fixed MoE graph breaks from ForwardContext lookups. (#1357)
Reset hybrid block_size to 128 for tool calling. (#1303)
Fixed hybrid model warmup block_size mismatch for Qwen3.5-35B-A3B. (#1462)
Fixed stale gate references overriding caller router_logits in the dp_size==1 MoE fast path. (#1469)
Fixed the Ernie4.5-VL test. (#1105)
Eliminated Llama4 torch.compile recompilations on HPU. (#1360)

Deprecation & Breaking Changes¶

Removed transformers installation from vllm-gaudi; it is now expected to be provided by the base environment. (#1494)

Full Changelog¶

PR	Title	Author
#1511	Heterogeneous and Homogeneous fixes for v0.21.0	@hsubramony
#1510	fix: prevent eager-mode env vars leaking to lazy-mode subprocesses	@adobrzyn
#1507	fix: raise default max_cudagraph_capture_size floor to 16384	@adobrzyn
#1503	Fix NIXL connector V1 API signature mismatches for hetero HPU	@hsubramony
#1500	QWN35: fix occasional segfault	@libinta
#1498	Fix gpt-oss mxfp4 TP partitioning and quant_method matching	@mkrze
#1494	Remove transformers installation from vllm-gaudi	@iboiko-habana
#1491	ci: route HF_TOKEN-using jobs through approved-workflow environment	@adobrzyn
#1489	Port of: Update lora tests	@iboiko-habana
#1486	Fix decode bucket generation for hybrid models with mismatched block sizes	@yangulei
#1471	Add pre-merge-approval for execute_pre_merge	@bmyrcha
#1469	Port of: Fix stale gate ref overriding caller router_logits in dp_size==1 MoE fast path	@iboiko-habana
#1466	Fix HPU prompt_token_ids device placement for penalty sampling	@yeonsily
#1462	Port of: fix: hybrid model warmup block_size mismatch (Qwen3.5-35B-A3B)	@iboiko-habana
#1460	Port of: Remove num_ctx_tokens_less_or_equal_batched_max_model_len filter	@iboiko-habana
#1459	Port of: fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 FP8 crash	@iboiko-habana
#1458	Port of: fix: replace batched_count_greater_than to avoid dynamic shape TypeError on HPU	@iboiko-habana
#1453	Port of: fix kernel block size	@iboiko-habana
#1449	Fix mamba_type comparison for GDN hybrid cache allocation	@shepark
#1447	Fix decode bucket filter issues	@yangulei
#1446	Add torchaudio-free copies of CD Dockerfiles	@PatrykWo
#1444	[MiniMax-M2] Remove reduce_results kwarg from FusedMoE init	@mounikamandava
#1443	Harden Qwen3.5 CI test to detect regressions	@shepark
#1442	Fix for MoE refactor	@iboiko-habana
#1440	fix extra masking for batched prefill in GDN layers	@osavchenkox
#1437	MRoPE accuracy fix for Qwen	@hsubramony
#1435	fix: guard CUDA sync in hf3fs mock client for HPU compatibility	@kamil-kaczor
#1433	Fixing condition for materialised causal attn_bias	@ksmusz
#1430	fix: patch torch.accelerator.empty_cache for HPU to fix import-order dependent cleanup failures	@kamil-kaczor
#1428	Fix moe_forward hidden_dim_unpadded parameter	@pawel-olejniczak
#1425	[DOC] Fix torchaudio version	@yangulei
#1424	fix: lower eagle3 spec_decode accuracy threshold to 0.60	@kamil-kaczor
#1422	CI cleanup v0.20.0	@iboiko-habana
#1421	Fix upstream regressions: MoE PluggableLayer recursion, MLA attention init crashes, KV offload module consolidation	@pawel-olejniczak
#1414	Enhance process management for online model swap example	@12010486
#1411	Fix load failure of mxfp4 gpt-oss-120b with expert parallel	@malsbat
#1410	Clarify VLLM_PROMPT_BS_BUCKET_MAX runtime behavior in docs	@iboiko-habana
#1409	Remove matmul_qk output-tensor compatibility gate after 1.24.0	@iboiko-habana
#1405	Fix prompt logprobs gathering on Gaudi HPU	@iboiko-habana
#1403	Fix upstream regressions: MoE refactor, DeepSeek V4 router, KV offload HMA	@pawel-olejniczak
#1400	Enable defrag when contig_pa is enabled	@iboiko-habana
#1399	QA test fixes for ValueError: No common block size for 16	@hsubramony
#1396	fix: enable Llama4 Maverick FP8 torch.compile without breaking DeepSeek	@adobrzyn
#1392	Fix warmup failure and run mm graph warmup pt compile only mode	@shepark
#1385	Update CODEOWNERS	@jbyczkow
#1384	Correct get_supported_kernel_block_sizes	@jbyczkow
#1383	fix: monkey-patch cleanup_dist_env_and_memory for HPU allocator	@kamil-kaczor
#1381	flatten 3D inputs_embeds in HpuModelAdapter.forward	@shepark
#1378	Set Docker auto calc PT_HPU_LAZY_MODE=0 as default	@nngokhale
#1377	Fix upstream breakages: NIXL connector, TpKVTopology rename, MoE refactor, transformers v5	@pawel-olejniczak
#1375	Fix offloading test lambdas for upstream req_context parameter	@pawel-olejniczak
#1366	Prefix caching support for HPUMambaMixer2	@jbyczkow
#1364	Logging omitted buckets when bucketing from file is enabled	@ksmusz
#1363	Set GaudiSW version in CI to 1.24.0	@afierka-intel
#1362	Bucketing edge cases finetune for longer ctx	@ksmusz
#1360	Eliminate Llama4 torch.compile recompilations on HPU	@adobrzyn
#1357	Fix MoE graph breaks from ForwardContext lookups	@jbyczkow
#1354	Fix upstream regressions in HPU worker, MoE router, and offloading tests	@pawel-olejniczak
#1352	Add pre-merge-trigger.yaml	@bmyrcha
#1351	Accept PEP 440 versions in build detection	@wjhrdy
#1349	Add hf_config parameter to HPU quantization config overrides	@pawel-olejniczak
#1342	Fix requirements paths and nixl_connector imports after upstream restructuring	@pawel-olejniczak
#1339	Qwen3.5 changes cherry-picked from release 0.19.0	@yeonsily
#1338	Fix upstream regressions in attention, FP8, offloading and platform	@pawel-olejniczak
#1332	Fix granite 4h block size calculations	@jkaniecki
#1329	feat: fix guard breaks and improve warmup time for Qwen3 MoE	@kamil-kaczor
#1327	Fix for proper KV cache slot addressing for Hybrid models	@ksmusz
#1324	Fix Synapse GC compile failure for FP8-quantized models	@slokesha
#1322	Separate conv1d for Granite 4.0	@jbyczkow
#1317	Optimizing visible block number for Hybrid kv_cache	@ksmusz
#1313	Remove splitting moe decode layer compilation func	@shepark
#1311	Fix Pixtral, MoE and Granite regressions	@pawel-olejniczak
#1304	Fix regression in Mistral-Large-3-675B	@skavulya
#1303	reset hybrid block_size to 128 for tool calling	@shepark
#1291	improve selective_state_update	@osavchenkox
#1279	Upstream vLLM compatibility fix	@iboiko-habana
#1270	Granite-4.0-h Calibration config	@mfylcek
#1264	fix: HPU-specific bug fixes for KV-offload + async spec-decode	@hsubramony
#1258	Add per-model toolcalling and fp8 configs; OpenAI /v1/models/switch entrypoint	@12010486
#1242	avoid using pip show to get habana-torch-plugin version	@dtrifiro
#1168	HPUCompressedTensorsW8A8Int8_BF16Fallback impl	@rsmyrek
#1160	Revert "Cap decode block bucket limit to reduce warmup time"	@adobrzyn
#1155	Enable slicing for the FusedSDPA	@yangulei
#1122	Fixes for the decode bucketing in non-contiguous pa scenario	@yangulei
#1105	fix ernie45-vl test	@jinyouzhi
#996	Remove all lazy execution mode from the codebase	@afierka-intel
#762	Add the padding-aware bucketing strategy	@yangulei

New Contributors¶

Welcome to the following first-time contributors to vLLM Gaudi Plugin!

@dtrifiro — avoid using pip show to get habana-torch-plugin version (#1242)
@malsbat — Fix load failure of mxfp4 gpt-oss-120b with expert parallel (#1411)
@mkrze — Fix gpt-oss mxfp4 TP partitioning and quant_method matching (#1498)
@mounikamandava — [MiniMax-M2] Remove reduce_results kwarg from FusedMoE init (#1444)
@osavchenkox — fix extra masking for batched prefill in GDN layers (#1440)
@wjhrdy — Accept PEP 440 versions in build detection (#1351)