Skip to content

vLLM Gaudi Plugin v0.15.1 Release Notes

Overview

This release is based on vLLM v0.15.1 and supports Intel® Gaudi® Software v1.23.0.


Highlights

  • Added validated support for Granite 4.0-h and Qwen3-VL (dense and MoE variants) on Intel Gaudi 3. Additionally, added significant Llama 4 stability fixes.
  • Introduced full chunked prefill attention support for HPU, enabling better memory utilization on long sequences (#821).
  • Integrated FlashAttention online merge in Unified Attention for improved prefill performance (#785).
  • Added KV cache sharing support for HPU, enabling more efficient multi-query scenarios (#834).
  • Introduced support for NVIDIA ModelOpt FP8 quantization format for dense models (#890).
  • Added HPU ops for Mamba mixer2, causal conv1d, and SSD combined kernels enabling hybrid SSM-Transformer models, such as Granite 4.0-h (#886, #897).
  • Added back-to-back matmul operation for improved Multi-Latent Attention (MLA) performance (#770).
  • Introduced prefill-side KV layout and block size support for heterogeneous (disaggregated) inference via NIXL (#867).

New Model Support

  • Add validated support for Qwen3-VL-32B-Instruct, Qwen3-VL-32B-Thinking, and Qwen3-VL-235B-A22B variants (Instruct, Thinking, FP8) on Gaudi 3 (#958)
  • Register the Qwen3VLMoeForConditionalGeneration model for Qwen3-VL MoE variants (#958)
  • Add IBM Granite 4.0-h small (hybrid SSM-Transformer) implementation for HPU (#897)

Performance

  • Add FlashAttention online merge in Unified Attention for faster prefill (#785)
  • Add back-to-back (b2b) matmul for improved MLA attention performance (#770)
  • Support loading q_scale and using fp8_fused_sdpa for MLA prefill (#909)
  • Remove bucket densification for long context; apply edge buckets only for long context scenarios (#980)
  • Implement bucket corrector for Mamba chunk size (#886)
  • Revert "skip HPU graphs for long prefills" to restore graph capture on long sequences (#850)
  • Port initialization profiling noop to reduce startup overhead (#979)

Attention & KV Cache

  • Add support for chunked attention on HPU (#821)
  • Add KV cache sharing for HPU (#834)
  • Enable support for prefill-side kv_layout and block_size update for heterogeneous runs (#867)
  • Add new VLLM_HPU_HETERO_KV_LAYOUT environment variable to control heterogeneous KV layout (#867)
  • Add heterogeneous HPU NIXL connector for disaggregated prefill/decode (#867)
  • Add hpu_attention ops module with attention operation implementations (#785)
  • Monkey-patch Attention.forward for HPU-specific behavior (#973)
  • Platform: declare support_hybrid_kv_cache capability (#834)

Quantization

  • Add support for ModelOpt FP8 quantization format for dense models (#890)
  • Add modelopt to platform supported quantization list (#890)
  • Add dynamic quantization configuration file example (#838)

Plugin Core

  • Register new ops: hpu_attention, hpu_grouped_topk_router, hpu_mamba_mixer2, and hpu_modelopt (#785, #897, #890)
  • Add ops_selector module for HPU operation routing (#897)
  • Add pytorch_implementation module with pure-PyTorch fallback ops (#897)
  • Add causal_conv1d_pytorch and ssd_combined ops for SSM/Mamba support (#897)
  • Add hpu_grouped_topk_router for MoE grouped top-k routing (#897)
  • Source use_qk_norm parameter directly from config (#1035)

Serving & Infrastructure

  • Add GitHub Actions action.yaml for PR detail workflows (#1030)
  • Add CI calibration smoke tests script (#853)
  • Rename and consolidate CI e2e discoverable tests (#840)
  • Fix Jenkins CI for Mistral model tests (#840)
  • Restore temperature=0 as server default after vLLM #32723 (#1038)
  • Backport RHEL/UBI Dockerfile improvements (#1049)

Fixes

  • Fix Llama 4 apply-patches flow, QK flatten positional encoding, and address performance drop (#942)
  • Fix Llama 4 shape mismatch for 32k+ context window (#842, #855)
  • Fix Qwen2.5-VL accuracy regression (#831)
  • Fix Qwen3-VL multimodal model embedding issues (#958)
  • Fix DeepSeek tensor device mismatch (#1029)
  • Force CPU loading for INC quantization to prevent OOM during weight loading (#1005)
  • Fix INC patching _gate twice (#955, #1020)
  • Fix HPU model runner profile_run to work with dynamic kv-cache scales (#852)
  • Fix measurement config file generation in calibrate_model.sh scripts (#853)
  • Revert padding value change for block_list and slot list (#1007)
  • Fix multimodal budget divergence from upstream vLLM (#837)
  • Fix hourly KeyError: <PlatformEnum.OOT: 6> error (#968)
  • Fix torch.compile in data-parallel mode (#722)
  • Correct sliding window enabling logic (#805)
  • Interleaved sliding window fix (#805)
  • Fix Mamba cumsum padded calculations (#1022)
  • Fix redundant transpose in HPUMambaMixer2 (#999, #1014)
  • Fix Qwen3-VL MoE execution failure (#992)
  • Fix last_chunk_indices calculations (#1024)

Security

CVE-2025-69872 (diskcache 5.6.3): vLLM currently depends on diskcache version 5.6.3, which has been reported as affected by CVE-2025-69872. The vulnerability remains unresolved upstream as of the day of this release. According to initial analysis, the vLLM architecture does not expose the vulnerable code path, meaning vLLM is not impacted in practice, despite the dependency being formally flagged.


Deprecation & Breaking Changes

  • Remove tests/models/utils.py to clean up unused test utilities (#864)
  • VLLM_HPU_HETERO_KV_LAYOUT environment variable is now required for heterogeneous (disaggregated) prefill/decode with NIXL (#867)
  • Remove bucket densification for long context workloads; only edge buckets are applied (#980)

Full Changelog

PR Title Author
#805 Interleaved sliding window fix @rsmyrek
#722 DP: Fix for torch.compile @xuechendi
#770 Add b2b matmul @linoybu
#785 Add FlashAttention online merge in Unified Attention @kzawora-intel
#805 Correct sliding window enabling @jbyczkow
#821 Add support for chunked attention @kfojcik-intel
#831 Resolve qwen25 vl accuracy regression @tvoas
#834 KV cache sharing for HPU @jakub-sochacki
#837 Fix diverge from vllm in multiModalBudget @linoybu
#838 Add dynamic quantization configuration file example @dudilester
#840 Jenkins CI fix for Mistral @iboiko-habana
#850 Revert "skip HPU graphs for long prefills" @adobrzyn
#851 Fix for vLLM #32077 @iboiko-habana
#852 Fix HPU model runner profile_run to work with dynamic kv-cache scales @dudilester
#853 Fix measurement config file generation in calibrate_model.sh @nirda7
#864 Remove unused test utils @microslaw
#867 Enable support for prefill side kv_layout and block_size update @yeonsily
#876 Refactor for vLLM #30623 and small fix for #32238 @iboiko-habana
#886 Implement bucket corrector for Mamba chunk size @jbyczkow
#890 Support for modelopt FP8 quantization format for dense models @skavulya
#897 HPU Granite 4.0-h small implementation @jbyczkow
#905 CODEOWNERS update @kzawora-intel
#909 Support loading q_scale and using fp8_fused_sdpa for MLA prefill @lkk12014402
#917 Fix for hourly KeyError: PlatformEnum.OOT @tzielinski-habana
#920 Update compatibility matrix and refine installation instructions @PatrykWo
#942 Llama4 apply patches + QK flatten pos + perf drop fix @Luca-Calabria
#943 Update Dockerfiles and documentation for v0.15.1 release @PatrykWo
#958 Qwen3_VL - multimodal model embedding fixes @slokesha
#968 Fix for hourly KeyError: PlatformEnum.OOT: 6 @tzielinski-habana
#973 Monkey-patch Attention.forward @tzielinski-habana
#979 Port: Initialization profiling noop @adobrzyn
#980 Remove bucket densification for long ctx; Edge buckets only @kfojcik-intel
#1003 Remove duplicate path @adobrzyn
#1005 Force CPU loading for INC quantization to prevent OOM @kamil-kaczor
#1007 Revert padding value change for block_list and slot list @kamil-kaczor
#1020 Fix INC patching _gate twice @kamil-kaczor
#1029 Fix tensor device mismatch in deepseek @kamil-kaczor
#1030 Adding action.yaml @iboiko-habana
#992 Fix qwen3 vl moe execution failure @shepark
#1014 Fixing redundant transpose in HPUMambaMixer2 @ksmusz
#1022 Fix mamba cumsum padded calculations @jkaniecki
#1024 last_chunk_indices calculations fix @jbyczkow
#1035 use_qk_norm parameter sourced directly from config @rsmyrek
#1038 Back temperature=0 for server as default @iboiko-habana
#1049 Backport RHEL/UBI Dockerfile improvements @PatrykWo

New Contributors

Welcome to the following first-time contributors to vLLM Gaudi Plugin! 🎉

  • @linoybu — b2b matmul and multimodal budget fix (#770, #837)
  • @microslaw — Test utilities cleanup (#864)
  • @nirda7 — Calibration script fixes (#853)
  • @tzielinski-habana — Platform stability fixes and Attention.forward monkey-patch (#917, #968, #973)
  • @yeonsily — Heterogeneous KV layout support (#867)
  • @jkaniecki — Mamba cumsum padded calculations fix (#1022)
  • @shepark — Qwen3-VL MoE execution fix (#992)