Skip to content

vLLM Gaudi Plugin v0.17.1 Release Notes

Overview

This release is based on vLLM v0.17.1 and supports Intel® Gaudi® Software v1.23.0 and Intel® Gaudi® Software v1.24.0.


Highlights

  • Added validated support for Ernie4.5-VL, GPT-OSS (20B/120B), and reranking models (Bert-based, Roberta-based, and Qwen3-based).
  • Introduced MxFP4 weight loading and dequantization support for Gaudi, enabling GPT-OSS model inference.
  • Introduced major Mamba/Granite 4.0-h improvements, including prefix caching support, custom depthwise conv1d TPC kernels, and precision enhancements.
  • Enhanced RowParallel NIC chunking for better distributed inference performance.
  • Added logprobs output functionality and Granite tool calling accuracy improvements.
  • Improved stability by fixing grammar bitmask corruption.

New Model Support

  • Added support for Ernie4.5-VL. (#813)
  • Ported the following reranking models: Bert-based, Roberta-based, and Qwen3-based. (#1001)
  • Added support for GPT-OSS models (lmsys/gpt-oss-20b-bf16, lmsys/gpt-oss-120b-bf16, openai/gpt-oss-20b, and openai/gpt-oss-120b) via MxFP4 weight dequantization. (#1251)
  • Enabled caching for Qwen3 MoE op. (#1249)

Performance

  • Optimized the selective_state_update reference in MambaMixer2 decode. (#1244)
  • Replaced fancy indexing with select and copy for Granite4 state updates. (#1210)
  • Created a custom depthwise conv1d kernel for MambaMixer2. (#1175)
  • Improved bf16 precision of _depthwise_conv1d_tpc. (#1203)
  • Improved hpu_mamba_chunk_scan_combined_varlen. (#997)
  • Improved RowParallel NIC chunking. (#896)
  • Added compute_logits to _compile_methods. (#1081)
  • Blocked B2BMatmul in dynamic quantization. (#1002)

Attention & KV Cache

  • Transposed state in conv1d instead of changing the KV cache shape. (#1025)
  • Fixed a KV cache memory regression caused by unconditional RowParallelLinear OOT registration. (#1215)
  • Added prefix caching support for HPUMambaMixer2. (#1198)

Quantization

  • Loaded and dequantized MxFP4 weights. (#1251)
  • Added a Granite-4.0-h calibration config. (#1221)
  • Forced CPU loading for INC quantization to prevent OOM during weight loading. (#1006)
  • Fixed a type mismatch in DeepSeek with fp8_fused_sdpa for MLA prefill. (#978)

Plugin Core

  • Added the num_spec field to MambaMixer2 for upstream compatibility. (#1142)
  • Fixed a SharedFusedMoE attribute error for Llama4 MoE layers. (#1172)
  • Removed a redundant transpose in HPUMambaMixer2. (#999)
  • Fixed an HPUMambaMixer2 inheritance dependency. (#1017)
  • Fixed Mamba cumsum padded calculations. (#1009)
  • Sourced the use_qk_norm parameter directly from the config. (#972)
  • Replaced MM dummy options. (#1085)
  • Added the logprobs functionality. (#1101)
  • Improved Granite tool-calling accuracy. (#1018)
  • Added the torch inference decorator back to warmup. (#1104)
  • Added a mechanism for adding events to tlparse. (#1054)
  • Fixed the gemma3 UT by replacing a tuple operation with a TC-friendly equivalent. (#1083)

Serving & Infrastructure

  • Set docker autocalc rules for reserved memory in Torch compile mode. (#1170)
  • Improved the docker autocalc linear recipe for long contexts. (#959)
  • Fixed the Dockerfile for RHEL 9.6 builds by updating the package installation order. (#1008)
  • Installed torchaudio from the CPU wheel to match the PyTorch version in the Dockerfile. (#1110)
  • Moved the inline Dockerfile to a separate file and added torchaudio. (#1050)
  • Installed torchaudio in CD Dockerfiles. (#1051)
  • Added the PT_VERSION argument and installed torchaudio in the Dockerfile. (#1043)
  • Removed pt_fork and a duplicated package from the UBI image. (#1066)
  • Restored the server default temperature=0 after #32723. (#1039)
  • Fixed setuptools package discovery to include sub-packages. (#1219)
  • Fixed the -u flag requiring an argument in calibrate_model.sh. (#1167)

Fixes

  • Fixed OOM crashes during high-concurrency inference. (#1252)
  • Fixed Qwen Out of Host Memory (OOM) errors. (#1256)
  • Fixed grammar bitmask corruption in mixed structured-output batches. (#1199)
  • Fixed Granite4.0h fallback bucket padding. (#1207)
  • Fixed a prefill bucket mismatch when prefills with no context were padded. (#1064)
  • Fixed default max decode blocks in exponential. (#1091)
  • Fixed an import error for MultiModalBudget. (#1062)
  • Fixed Qwen3-VL warmup. (#994)
  • Prevented server crashes when requests were canceled. (#990)
  • Fixed a parameter mismatch for compute_nixl_compatibility_hash(). (#1224)

Security

  • Fixed SDL secure error handling issues. (#1246)
  • Fixed coverity issues, including security, null-like values, duplicates, and typos. (#1163)

Full Changelog

PR Title Author
#813 Add Support for Ernie4.5-VL @jinyouzhi
#896 RowParallel NIC chunking @kamil-kaczor
#921 Add fp8 calibration tests to CI @afierka-intel
#959 Improve docker autocalc linear recipe for long contexts @nngokhale
#967 Add ci test for granite-4-h-small @microslaw
#972 use_qk_norm parameter sourced directly from config @rsmyrek
#978 Fix type mismatch in DeepSeek with fp8_fused_sdpa for mla prefill @skavulya
#990 Server doesn't crash when request is canceled @tzielinski-habana
#994 Qwen3-VL WarmUp Fix @slokesha
#995 Add sleep mode model swapping test for Gaudi @PatrykWo
#997 hpu_mamba_chunk_scan_combined_varlen improvements @PatrykWilczewski
#998 Hourly fixes – batch no. 2 @pawel-olejniczak
#999 Fixing redundant transpose in HPUMambaMixer2 @ksmusz
#1001 Ported the reranking models: Bert-based, Roberta-based and Qwen3-based @gyou2021
#1002 Blocking B2BMatmul in dynamic quantization @HolyFalafel
#1006 Force CPU loading for INC quantization to prevent OOM during weight loading @agrabow
#1008 Fix Dockerfile for RHEL 9.6 build by updating package installation order @PatrykWo
#1009 Fix mamba cumsum padded calculations @jkaniecki
#1017 Fix HPUMambaMixer2 inheritance dependency @jbyczkow
#1018 Granite accuracy for tool calling @adobrzyn
#1025 Instead of changing kv cache shape, transpose state in conv1d @jmamzax
#1031 Change Qwen3VL to use HPUMMEncoderAttention @jiminha
#1039 Back temperature=0 for server as default after #32723 @iboiko-habana
#1043 Add PT_VERSION argument and install torchaudio in Dockerfile @PatrykWo
#1050 Moved inline Dockerfile to a separate file and added torchaudio @tzielinski-habana
#1051 Install torchaudio in CD Dockerfiles @tzielinski-habana
#1053 Hourly fixes – batch no. 3 @pawel-olejniczak
#1054 Added mechanism for adding events to tlparse @jczaja
#1062 Fix import error for MultiModalBudget @tvoas
#1064 Fix prefill bucket mismatch when prefills with no context are padded @mfylcek
#1066 UBI image: remove pt_fork and duplicated package @ghandoura
#1067 Add workflow to update VLLM_COMMUNITY_COMMIT via GitHub Actions @PatrykWo
#1081 Adding compute_logits to _compile_methods @ksmusz
#1083 Fix to gemma3 UT — replaced tuple operation by TC friendly equivalent @jczaja
#1085 Replace mm dummy options @skaulintel
#1090 Fix for MoE refactor #32344 @iboiko-habana
#1091 Fix for default max decode blocks in exponential @adobrzyn
#1101 Logprobs functionality @adobrzyn
#1104 Add torch inference decorator back to warmup @skaulintel
#1108 Hourly fixes, part 3 @iboiko-habana
#1110 Install torchaudio from CPU wheel to match PyTorch version in Dockerfile @PatrykWo
#1114 Hourly fixes, part 4 @iboiko-habana
#1115 Fix for vLLM #35503 @iboiko-habana
#1116 Fix for vLLM #35503 @iboiko-habana
#1125 Cherry from 0.16.0 release @PatrykWo
#1142 Add num_spec field to MambaMixer2 for upstream compatibility @jbyczkow
#1163 Coverity fix including security, null-like values, duplicates and typos @adobrzyn
#1167 Fix -u flag requiring argument in calibrate_model.sh @adobrzyn
#1170 Set docker auto calc rules for reserved memory in Torch compile mode @nngokhale
#1172 Fix SharedFusedMoE attribute error for Llama4 MoE layers @adobrzyn
#1175 Creating custom depthwise conv1d kernel for MambaMixer2 @ksmusz
#1178 Update quickstart guide and supported model list @PatrykWo
#1198 Prefix caching support for HPUMambaMixer2 @jbyczkow
#1199 Fix grammar bitmask corruption in mixed structured-output batches @jbyczkow
#1203 Improving precision of _depthwise_conv1d_tpc for bf16 @ksmusz
#1207 Granite4.0h fallback bucket padding fix @mfylcek
#1210 Replacing fancy indexing with select and copy for Granite4 state update @ksmusz
#1215 Fix KV cache memory regression from unconditional RowParallelLinear OOT registration @kamil-kaczor
#1219 Fix setuptools package discovery to include sub-packages @app/copilot-swe-agent
#1221 Granite-4.0-h Calibration config @mfylcek
#1224 Fix param mismatch for compute_nixl_compatibility_hash @hsubramony
#1228 Add ci test for granite-4-h-small to 0.17.1 @microslaw
#1244 Optimization of selective_state_update ref in MambaMixer2 decode @ksmusz
#1246 SDL secure error handling fixes @adobrzyn
#1249 Enable caching for qwen3 moe op @shepark
#1251 Load and Dequant MxFP4 Weights @SKRohit
#1252 Fix OOM crashes during high-concurrency inference @afierka-intel
#1256 Fix of Qwen Out of HOST memory (OOM) @iboiko-habana

New Contributors

Welcome to the following first-time contributors to vLLM Gaudi Plugin!

  • @gyou2021 — Ported reranking models: Bert-based, Roberta-based and Qwen3-based (#1001)
  • @jczaja — Added mechanism for adding events to tlparse (#1054)
  • @jinyouzhi — Add Support for Ernie4.5-VL (#813)
  • @mfylcek — Granite4.0h fallback bucket padding fix (#1207)
  • @pawel-olejniczak — Hourly upstream compatibility fixes (#998)
  • @skaulintel — Replace mm dummy options and warmup improvements (#1085)