Skip to content

vLLM Gaudi Plugin v0.19.0 Release Notes

Overview

This release is based on vLLM v0.19.0 and supports Intel® Gaudi® Software v1.24.0 with PyTorch 2.10.


Highlights

  • Upgraded platform compatibility to Intel® Gaudi® Software v1.24.0 and PyTorch 2.10.
  • Introduced Qwen 3.5 model support with compact mode for improved memory utilization.
  • Introduced Mamba prefix caching support for hybrid SSM-Transformer models on v0.19.0.
  • Added MxFP4 weight loading and dequantization for next-generation quantization formats.
  • Integrated LMCache support via monkey-patching for external cache backends.
  • Introduced custom depthwise conv1d TPC kernel for MambaMixer2 to improve hybrid model performance.
  • Adapted the online defragmenter for torch.compile mode, enabling memory defragmentation in compiled execution.
  • Improved warmup performance by capping decode block bucket limits.

New Model Support

  • Added initial Qwen 3.5 model support. (#1153)
  • Added Qwen 3.5 compact mode. (#1235)
  • Added Qwen 3.5 additional changes and fixes. (#1312)

Performance

  • Created a custom depthwise conv1d kernel for MambaMixer2. (#1092)
  • Adapted the online defragmenter for torch.compile. (#986)
  • Set reserved memory for torch.compile. (#1093)
  • Capped the decode block bucket limit to reduce warmup time. (#1160)
  • Optimized selective_state_update. (#1295)
  • Optimized the visible block number for hybrid KV cache. (#1319)

Attention & KV Cache

  • Added Mamba prefix caching support for v0.19.0. (#1330)
  • Fixed proper KV cache slot addressing for hybrid models. (#1323)
  • Resolved KV cache access in HPUMambaMixer2 and reintroduced Granite4.0 in CI. (#1287)
  • Removed dead Unified Attention (UA) code. (#1226)
  • Fixed a KV cache memory regression caused by unconditional RowParallelLinear OOT registration. (#1146)
  • Excluded dummy block from NIXL KV cache registration. (#1140)

Quantization

  • Loaded and dequantized MxFP4 weights. (#1156)
  • Fixed INC FP8 dynamic quantization for MoE models on HPU. (#1183)
  • Fixed FP8 block-to-channel conversion breaking MLA weight loading. (#1220)
  • Fixed INC/MLA alias-path quantization failures. (#1222)
  • Added Granite-4.0-h calibration config. (#1221)
  • Fixed Synapse GC compile failure for FP8-quantized models. (#1334)

Plugin Core

  • Added a patch for LMCache. (#1176)
  • Removed aggregate module HpuDeepseekOCRVisual. (#1102)
  • Removed deprecated virtual_engine from ForwardContext. (#1187)
  • Fixed CPUOffloadingSpec import path and removed obsolete roberta patch. (#1229)
  • Separated conv1d for Granite 4.0 (v0.17.1-style). (#1320)
  • Added the num_spec field to MambaMixer2 for upstream compatibility. (#1141)
  • Fixed setuptools package discovery to include all sub-packages. (#1212)

Serving & Infrastructure

  • Parameterized EXTRA_INDEX_URL in Dockerfiles. (#1131)
  • Added VLLM_REPO and VLLM_GAUDI_REPO arguments to RHEL UBI Dockerfile. (#1225)
  • Added real context length to the high-level profile. (#1169)
  • Added more than 2 models to the sleep mode model swapping test. (#1100)
  • Added AI agents config files. (#1123)
  • Updated the quickstart guide and supported model list. (#1173)

Fixes

  • Fixed OOM crashes during high-concurrency inference. (#1124)
  • Fixed multimodal prefill batching for 2D padded inputs. (#1126)
  • Fixed M-RoPE position tensor shape for batched multimodal prefill (BS>1). (#1216)
  • Fixed preempted prompts and prefill/decoding splitting. (#830)
  • Fixed grammar bitmask corruption in mixed structured-output batches. (#1200)
  • Fixed Qwen out of host memory (OOM) errors. (#1247)
  • Fixed a SharedFusedMoE attribute error for Llama4 MoE layers. (#1172)
  • Fixed false-positive cross-layer block detection for MLA in NIXL. (#1205)
  • Fixed block size setting for Granite 4.0h. (#1318)
  • Fixed Granite4.0h fallback bucket padding. (#1207)
  • Fixed wrong AI Lab names in validated_models.md. (#1282)
  • Fixed the -u flag requiring an argument in calibrate_model.sh. (#1121)

Security

  • Fixed coverity issues, including security, null-like values, duplicates, and typos. (#1164)
  • Fixed SDL secure error handling issues. (#1245)

Deprecation & Breaking Changes

  • Upgraded to Intel® Gaudi® Software v1.24.0 and PyTorch 2.10, which requires users to update their Intel Gaudi software stack from v1.23.0 to v1.24.0.
  • Removed unused Unified Attention (UA) code. (#1226).
  • Removed the aggregate module HpuDeepseekOCRVisual. (#1102).
  • Removed deprecated virtual_engine from ForwardContext. (#1187).

Full Changelog

PR Title Author
#1334 Fix Synapse GC compile failure for FP8-quantized models @jiminha
#1330 Mamba prefix caching support for v0.19.0 @jbyczkow
#1323 Fix for proper KV cache slot addressing for Hybrid models @ksmusz
#1320 Separate conv1d for Granite 4.0 (v0.17.1-style) @jbyczkow
#1319 Optimizing visible block number for Hybrid kv_cache @ksmusz
#1318 Fix block size setting for granite 4.0h @jkaniecki
#1315 Granite-4.0-h Calibration config @jbyczkow
#1312 Changes for qwen35 @shepark
#1295 Optimize selective_state_update @jbyczkow
#1287 Resolving kv_cache access in HPUMambaMixer2 and reintroducing Granite4.0 in CI @ksmusz
#1286 Temporarily removing granite-4-h-small from CI @ksmusz
#1282 Fix wrong AI Lab names in validated_models.md @MaxAmende
#1279 Upstream vLLM compatibility fix @iboiko-habana
#1262 ci: fix EOF error when PR title contains apostrophe @adobrzyn
#1247 Fix of Qwen Out of HOST memory (OOM) @iboiko-habana
#1245 SDL secure error handling fixes @adobrzyn
#1235 qwen35 compact mode @libinta
#1229 Fix CPUOffloadingSpec import path and remove obsolete roberta patch @pawel-olejniczak
#1226 Remove dead Unified Attention (UA) code @adobrzyn
#1225 Add VLLM_REPO and VLLM_GAUDI_REPO args to RHEL UBI Dockerfile @aung-san-i
#1222 Fix INC/MLA alias-path quantization failures @pawel-olejniczak
#1221 Granite-4.0-h Calibration config @mfylcek
#1220 Fix FP8 block-to-channel conversion breaking MLA weight loading @afierka-intel
#1216 Fix M-RoPE position tensor shape for batched multimodal prefill (BS>1) @afierka-intel
#1214 Fix param mismatch for compute_nixl_compatibility_hash() @hsubramony
#1212 Fix include all sub-packages in setuptools package discovery @Xaenalt
#1207 Granite4.0h fallback bucket padding fix @mfylcek
#1205 Fix false-positive cross-layer block detection for MLA in NIXL @iboiko-habana
#1200 Fix grammar bitmask corruption in mixed structured-output batches @jbyczkow
#1194 Port: Fix SharedFusedMoE attribute error for Llama4 MoE layers @adobrzyn
#1187 Remove deprecated virtual_engine from ForwardContext @iboiko-habana
#1183 Fix INC FP8 dynamic quantization for MoE models on HPU @yeonsily
#1181 Disable nixl CI tests @iboiko-habana
#1176 Monkey patch for LMCache @hlin99
#1174 Upstream vLLM hourly fix @tzielinski-habana
#1173 Update quickstart guide and supported model list @PatrykWo
#1172 Fix SharedFusedMoE attribute error for Llama4 MoE layers @adobrzyn
#1169 Add real context length to the high-level profile @yangulei
#1165 Reintroduce ci test for granite-4-h-small @microslaw
#1164 Coverity fix including security, null-like values, duplicates and typos @adobrzyn
#1160 Cap decode block bucket limit to reduce warmup time @adobrzyn
#1156 Load and Dequant MxFP4 Weights @SKRohit
#1153 qwen35 initial enablement @libinta
#1146 Fix KV cache memory regression from unconditional RowParallelLinear OOT registration @kamil-kaczor
#1141 Add num_spec field to MambaMixer2 for upstream compatibility @jbyczkow
#1140 Exclude dummy block from NIXL KV cache registration @yeonsily
#1136 PR-1054 revert @jczaja
#1135 Temporary nixl test cases disablement @iboiko-habana
#1131 Parameterize EXTRA_INDEX_URL @PatrykWo
#1129 Upstream vLLM compatibility fixes @iboiko-habana
#1126 Fix multimodal prefill batching for 2D padded inputs @afierka-intel
#1124 Fix OOM crashes during high-concurrency inference @afierka-intel
#1123 Add AI agents config files @kamil-kaczor
#1121 Fix -u flag requiring argument in calibrate_model.sh @afierka-intel
#1102 Remove aggregate module HpuDeepseekOCRVisual @jwieczorekhabana
#1100 Add more than 2 models to sleep mode model swapping test @12010486
#1093 Set reserved mem for Torch compile @nngokhale
#1092 Creating custom depthwise conv1d kernel for MambaMixer2 @ksmusz
#986 Adapt Online defragmenter for torch compile @jwieczorekhabana
#830 Fix preempted prompts and prefill/decoding splitting @yangulei

New Contributors

Welcome to the following first-time contributors to vLLM Gaudi Plugin!

  • @aung-san-i — Add VLLM_REPO and VLLM_GAUDI_REPO args to RHEL UBI Dockerfile (#1225)
  • @MaxAmende — Fix wrong AI Lab names in validated_models.md (#1282)
  • @Xaenalt — Fix setuptools package discovery to include sub-packages (#1212)
  • @12010486 — Add more than 2 models to sleep mode model swapping test (#1100)
  • @hlin99 — Monkey patch for LMCache (#1176)