vLLM Gaudi Plugin v0.21.0 Release Notes¶
Overview¶
This release is based on vLLM v0.21.0 and supports Intel® Gaudi® Software v1.24.0 with PyTorch 2.10.
Highlights¶
- Removed lazy execution mode from CI, making eager execution the default for CI pipelines while retaining lazy mode support at runtime. (#996)
- Introduced a new padding-aware bucketing strategy to improve memory utilization and reduce padding overhead. (#762)
- Added W8A8 INT8 quantization with BF16 fallback via
HPUCompressedTensorsW8A8Int8_BF16Fallback. (#1168) - Enabled FusedSDPA slicing to improve attention performance. (#1155)
- Added an OpenAI-compatible
/v1/models/switchentrypoint together with per-model tool-calling and FP8 configs for online model swapping. (#1258) - Added HPU-specific fixes for KV offload and asynchronous speculative decoding. (#1264)
- Resolved NIXL connector issues in heterogeneous and homogeneous deployment scenarios. (#1511, #1503)
Performance¶
- Introduced a new padding-aware bucketing strategy. (#762)
- Enabled slicing for the FusedSDPA attention path. (#1155)
- Skipped HPU graphs for long (query + context) prefills. (#1346)
- Fixed guard breaks and improved warmup time for Qwen3 MoE. (#1329)
- Improved
selective_state_updateperformance. (#1291) - Removed splitting MoE decode layer compilation function. (#1313)
- Optimized the visible block number for hybrid KV cache. (#1317)
- Fine-tuned bucketing edge cases for longer contexts. (#1362)
- Raised the default
max_cudagraph_capture_sizefloor to16384. (#1507)
Attention & KV Cache¶
- Added prefix caching support for HPUMambaMixer2. (#1366)
- Fixed the condition for materialized causal
attn_bias. (#1433) - Fixed proper KV cache slot addressing for hybrid models. (#1327)
- Fixed
mamba_typecomparison for GDN hybrid cache allocation. (#1449) - Fixed extra masking for batched prefill in GDN layers. (#1440)
- Fixed HPU-specific bugs affecting KV offload and asynchronous speculative decoding. (#1264)
Quantization¶
- Implemented
HPUCompressedTensorsW8A8Int8_BF16Fallbackfor W8A8 INT8 quantization with BF16 fallback. (#1168) - Fixed Synapse GC compile failure for FP8-quantized models. (#1324)
- Enabled Llama4 Maverick FP8
torch.compilesupport without breaking DeepSeek. (#1396) - Fixed GPT-OSS MxFP4 TP partitioning and
quant_methodmatching. (#1498) - Added Granite-4.0-h calibration config. (#1270)
- Fixed load failure of MxFP4 GPT-OSS-120B with expert parallel. (#1411)
- Added
hf_configparameter to HPU quantization config overrides. (#1349)
Plugin Core¶
- Removed lazy execution mode from CI, making eager execution the default in CI pipelines. (#996)
- Accepted PEP 440 versions in build detection. (#1351)
- Patched
torch.accelerator.empty_cachefor HPU to fix import-order dependent cleanup failures. (#1430) - Removed
matmul_qkoutput-tensor compatibility gate after 1.24.0. (#1409) - Removed
transformersinstallation from vllm-gaudi. (#1494) - Prevented eager-mode environment variables from leaking into lazy-mode subprocesses. (#1510)
- Fixed multiple upstream regressions across MoE, MLA, NIXL, attention, FP8, offloading, and platform modules. (#1279, #1311, #1338, #1342, #1354, #1375, #1377, #1403, #1421, #1428, #1442)
- Ported fixes for the MoE fast path, dynamic shapes, kernel block sizes, and batched count operations. (#1453, #1458, #1459, #1460, #1469)
Serving & Infrastructure¶
- Added torchaudio-free copies of CD Dockerfiles. (#1446)
- Set Docker
PT_HPU_LAZY_MODE=0as the default auto-calculated value. (#1378) - Added OpenAI-compatible
/v1/models/switchentrypoint integration. (#1258) - Added per-model tool-calling and FP8 configs. (#1258)
- Enhanced process management for the online model swap example. (#1414)
- Fixed NIXL connector v1 API signature mismatches for heterogeneous HPU. (#1503)
- Fixed heterogeneous and homogeneous NIXL deployment issues for v0.21.0. (#1511)
- Enabled defragmentation when contiguous PA was enabled. (#1400)
- Clarified
VLLM_PROMPT_BS_BUCKET_MAXruntime behavior in the documentation. (#1410)
Fixes¶
- Fixed occasional Qwen3.5 segfaults. (#1500)
- Fixed decode bucket generation for hybrid models with mismatched block sizes. (#1486)
- Fixed HPU
prompt_token_idsdevice placement for penalty sampling. (#1466) - Fixed decode bucket filter issues. (#1447)
- Fixed decode bucketing in non-contiguous PA scenarios. (#1122)
- Fixed MRoPE accuracy for Qwen models. (#1437)
- Fixed warmup failures and multimodal graph warmup in PT compile-only mode. (#1392)
- Flattened 3D
inputs_embedsinHpuModelAdapter.forward. (#1381) - Fixed prompt logprobs gathering on Gaudi HPU. (#1405)
- Fixed a regression in Mistral-Large-3-675B. (#1304)
- Fixed Granite 4H block size calculations. (#1332)
- Separated
conv1dfor Granite 4.0. (#1322) - Corrected
get_supported_kernel_block_sizes. (#1384) - Fixed MoE graph breaks from
ForwardContextlookups. (#1357) - Reset hybrid
block_sizeto128for tool calling. (#1303) - Fixed hybrid model warmup
block_sizemismatch for Qwen3.5-35B-A3B. (#1462) - Fixed stale gate references overriding caller
router_logitsin thedp_size==1MoE fast path. (#1469) - Fixed the Ernie4.5-VL test. (#1105)
- Eliminated Llama4
torch.compilerecompilations on HPU. (#1360)
Deprecation & Breaking Changes¶
Removed transformers installation from vllm-gaudi; it is now expected to be provided by the base environment. (#1494)
Full Changelog¶
| PR | Title | Author |
|---|---|---|
| #1511 | Heterogeneous and Homogeneous fixes for v0.21.0 | @hsubramony |
| #1510 | fix: prevent eager-mode env vars leaking to lazy-mode subprocesses | @adobrzyn |
| #1507 | fix: raise default max_cudagraph_capture_size floor to 16384 | @adobrzyn |
| #1503 | Fix NIXL connector V1 API signature mismatches for hetero HPU | @hsubramony |
| #1500 | QWN35: fix occasional segfault | @libinta |
| #1498 | Fix gpt-oss mxfp4 TP partitioning and quant_method matching | @mkrze |
| #1494 | Remove transformers installation from vllm-gaudi | @iboiko-habana |
| #1491 | ci: route HF_TOKEN-using jobs through approved-workflow environment | @adobrzyn |
| #1489 | Port of: Update lora tests | @iboiko-habana |
| #1486 | Fix decode bucket generation for hybrid models with mismatched block sizes | @yangulei |
| #1471 | Add pre-merge-approval for execute_pre_merge | @bmyrcha |
| #1469 | Port of: Fix stale gate ref overriding caller router_logits in dp_size==1 MoE fast path | @iboiko-habana |
| #1466 | Fix HPU prompt_token_ids device placement for penalty sampling | @yeonsily |
| #1462 | Port of: fix: hybrid model warmup block_size mismatch (Qwen3.5-35B-A3B) | @iboiko-habana |
| #1460 | Port of: Remove num_ctx_tokens_less_or_equal_batched_max_model_len filter | @iboiko-habana |
| #1459 | Port of: fix: bypass _forward_impl for dp_size==1 to fix DeepSeek R1 FP8 crash | @iboiko-habana |
| #1458 | Port of: fix: replace batched_count_greater_than to avoid dynamic shape TypeError on HPU | @iboiko-habana |
| #1453 | Port of: fix kernel block size | @iboiko-habana |
| #1449 | Fix mamba_type comparison for GDN hybrid cache allocation | @shepark |
| #1447 | Fix decode bucket filter issues | @yangulei |
| #1446 | Add torchaudio-free copies of CD Dockerfiles | @PatrykWo |
| #1444 | [MiniMax-M2] Remove reduce_results kwarg from FusedMoE init | @mounikamandava |
| #1443 | Harden Qwen3.5 CI test to detect regressions | @shepark |
| #1442 | Fix for MoE refactor | @iboiko-habana |
| #1440 | fix extra masking for batched prefill in GDN layers | @osavchenkox |
| #1437 | MRoPE accuracy fix for Qwen | @hsubramony |
| #1435 | fix: guard CUDA sync in hf3fs mock client for HPU compatibility | @kamil-kaczor |
| #1433 | Fixing condition for materialised causal attn_bias | @ksmusz |
| #1430 | fix: patch torch.accelerator.empty_cache for HPU to fix import-order dependent cleanup failures | @kamil-kaczor |
| #1428 | Fix moe_forward hidden_dim_unpadded parameter | @pawel-olejniczak |
| #1425 | [DOC] Fix torchaudio version | @yangulei |
| #1424 | fix: lower eagle3 spec_decode accuracy threshold to 0.60 | @kamil-kaczor |
| #1422 | CI cleanup v0.20.0 | @iboiko-habana |
| #1421 | Fix upstream regressions: MoE PluggableLayer recursion, MLA attention init crashes, KV offload module consolidation | @pawel-olejniczak |
| #1414 | Enhance process management for online model swap example | @12010486 |
| #1411 | Fix load failure of mxfp4 gpt-oss-120b with expert parallel | @malsbat |
| #1410 | Clarify VLLM_PROMPT_BS_BUCKET_MAX runtime behavior in docs | @iboiko-habana |
| #1409 | Remove matmul_qk output-tensor compatibility gate after 1.24.0 | @iboiko-habana |
| #1405 | Fix prompt logprobs gathering on Gaudi HPU | @iboiko-habana |
| #1403 | Fix upstream regressions: MoE refactor, DeepSeek V4 router, KV offload HMA | @pawel-olejniczak |
| #1400 | Enable defrag when contig_pa is enabled | @iboiko-habana |
| #1399 | QA test fixes for ValueError: No common block size for 16 | @hsubramony |
| #1396 | fix: enable Llama4 Maverick FP8 torch.compile without breaking DeepSeek | @adobrzyn |
| #1392 | Fix warmup failure and run mm graph warmup pt compile only mode | @shepark |
| #1385 | Update CODEOWNERS | @jbyczkow |
| #1384 | Correct get_supported_kernel_block_sizes | @jbyczkow |
| #1383 | fix: monkey-patch cleanup_dist_env_and_memory for HPU allocator | @kamil-kaczor |
| #1381 | flatten 3D inputs_embeds in HpuModelAdapter.forward | @shepark |
| #1378 | Set Docker auto calc PT_HPU_LAZY_MODE=0 as default | @nngokhale |
| #1377 | Fix upstream breakages: NIXL connector, TpKVTopology rename, MoE refactor, transformers v5 | @pawel-olejniczak |
| #1375 | Fix offloading test lambdas for upstream req_context parameter | @pawel-olejniczak |
| #1366 | Prefix caching support for HPUMambaMixer2 | @jbyczkow |
| #1364 | Logging omitted buckets when bucketing from file is enabled | @ksmusz |
| #1363 | Set GaudiSW version in CI to 1.24.0 | @afierka-intel |
| #1362 | Bucketing edge cases finetune for longer ctx | @ksmusz |
| #1360 | Eliminate Llama4 torch.compile recompilations on HPU | @adobrzyn |
| #1357 | Fix MoE graph breaks from ForwardContext lookups | @jbyczkow |
| #1354 | Fix upstream regressions in HPU worker, MoE router, and offloading tests | @pawel-olejniczak |
| #1352 | Add pre-merge-trigger.yaml | @bmyrcha |
| #1351 | Accept PEP 440 versions in build detection | @wjhrdy |
| #1349 | Add hf_config parameter to HPU quantization config overrides | @pawel-olejniczak |
| #1342 | Fix requirements paths and nixl_connector imports after upstream restructuring | @pawel-olejniczak |
| #1339 | Qwen3.5 changes cherry-picked from release 0.19.0 | @yeonsily |
| #1338 | Fix upstream regressions in attention, FP8, offloading and platform | @pawel-olejniczak |
| #1332 | Fix granite 4h block size calculations | @jkaniecki |
| #1329 | feat: fix guard breaks and improve warmup time for Qwen3 MoE | @kamil-kaczor |
| #1327 | Fix for proper KV cache slot addressing for Hybrid models | @ksmusz |
| #1324 | Fix Synapse GC compile failure for FP8-quantized models | @slokesha |
| #1322 | Separate conv1d for Granite 4.0 | @jbyczkow |
| #1317 | Optimizing visible block number for Hybrid kv_cache | @ksmusz |
| #1313 | Remove splitting moe decode layer compilation func | @shepark |
| #1311 | Fix Pixtral, MoE and Granite regressions | @pawel-olejniczak |
| #1304 | Fix regression in Mistral-Large-3-675B | @skavulya |
| #1303 | reset hybrid block_size to 128 for tool calling | @shepark |
| #1291 | improve selective_state_update | @osavchenkox |
| #1279 | Upstream vLLM compatibility fix | @iboiko-habana |
| #1270 | Granite-4.0-h Calibration config | @mfylcek |
| #1264 | fix: HPU-specific bug fixes for KV-offload + async spec-decode | @hsubramony |
| #1258 | Add per-model toolcalling and fp8 configs; OpenAI /v1/models/switch entrypoint | @12010486 |
| #1242 | avoid using pip show to get habana-torch-plugin version | @dtrifiro |
| #1168 | HPUCompressedTensorsW8A8Int8_BF16Fallback impl | @rsmyrek |
| #1160 | Revert "Cap decode block bucket limit to reduce warmup time" | @adobrzyn |
| #1155 | Enable slicing for the FusedSDPA | @yangulei |
| #1122 | Fixes for the decode bucketing in non-contiguous pa scenario | @yangulei |
| #1105 | fix ernie45-vl test | @jinyouzhi |
| #996 | Remove all lazy execution mode from the codebase | @afierka-intel |
| #762 | Add the padding-aware bucketing strategy | @yangulei |
New Contributors¶
Welcome to the following first-time contributors to vLLM Gaudi Plugin!
- @dtrifiro — avoid using pip show to get habana-torch-plugin version (#1242)
- @malsbat — Fix load failure of mxfp4 gpt-oss-120b with expert parallel (#1411)
- @mkrze — Fix gpt-oss mxfp4 TP partitioning and quant_method matching (#1498)
- @mounikamandava — [MiniMax-M2] Remove reduce_results kwarg from FusedMoE init (#1444)
- @osavchenkox — fix extra masking for batched prefill in GDN layers (#1440)
- @wjhrdy — Accept PEP 440 versions in build detection (#1351)