vLLM Gaudi Plugin v0.19.0 Release Notes¶

Overview¶

This release is based on vLLM v0.19.0 and supports Intel® Gaudi® Software v1.24.0 with PyTorch 2.10.

Highlights¶

Upgraded platform compatibility to Intel® Gaudi® Software v1.24.0 and PyTorch 2.10.
Introduced Qwen 3.5 model support with compact mode for improved memory utilization.
Introduced Mamba prefix caching support for hybrid SSM-Transformer models on v0.19.0.
Added MxFP4 weight loading and dequantization for next-generation quantization formats.
Integrated LMCache support via monkey-patching for external cache backends.
Introduced custom depthwise conv1d TPC kernel for MambaMixer2 to improve hybrid model performance.
Adapted the online defragmenter for torch.compile mode, enabling memory defragmentation in compiled execution.
Improved warmup performance by capping decode block bucket limits.

New Model Support¶

Added initial Qwen 3.5 model support. (#1153)
Added Qwen 3.5 compact mode. (#1235)
Added Qwen 3.5 additional changes and fixes. (#1312)

Performance¶

Created a custom depthwise conv1d kernel for MambaMixer2. (#1092)
Adapted the online defragmenter for torch.compile. (#986)
Set reserved memory for torch.compile. (#1093)
Capped the decode block bucket limit to reduce warmup time. (#1160)
Optimized selective_state_update. (#1295)
Optimized the visible block number for hybrid KV cache. (#1319)

Attention & KV Cache¶

Added Mamba prefix caching support for v0.19.0. (#1330)
Fixed proper KV cache slot addressing for hybrid models. (#1323)
Resolved KV cache access in HPUMambaMixer2 and reintroduced Granite4.0 in CI. (#1287)
Removed dead Unified Attention (UA) code. (#1226)
Fixed a KV cache memory regression caused by unconditional RowParallelLinear OOT registration. (#1146)
Excluded dummy block from NIXL KV cache registration. (#1140)

Quantization¶

Loaded and dequantized MxFP4 weights. (#1156)
Fixed INC FP8 dynamic quantization for MoE models on HPU. (#1183)
Fixed FP8 block-to-channel conversion breaking MLA weight loading. (#1220)
Fixed INC/MLA alias-path quantization failures. (#1222)
Added Granite-4.0-h calibration config. (#1221)
Fixed Synapse GC compile failure for FP8-quantized models. (#1334)

Plugin Core¶

Added a patch for LMCache. (#1176)
Removed aggregate module HpuDeepseekOCRVisual. (#1102)
Removed deprecated virtual_engine from ForwardContext. (#1187)
Fixed CPUOffloadingSpec import path and removed obsolete roberta patch. (#1229)
Separated conv1d for Granite 4.0 (v0.17.1-style). (#1320)
Added the num_spec field to MambaMixer2 for upstream compatibility. (#1141)
Fixed setuptools package discovery to include all sub-packages. (#1212)

Serving & Infrastructure¶

Parameterized EXTRA_INDEX_URL in Dockerfiles. (#1131)
Added VLLM_REPO and VLLM_GAUDI_REPO arguments to RHEL UBI Dockerfile. (#1225)
Added real context length to the high-level profile. (#1169)
Added more than 2 models to the sleep mode model swapping test. (#1100)
Added AI agents config files. (#1123)
Updated the quickstart guide and supported model list. (#1173)

Fixes¶

Fixed OOM crashes during high-concurrency inference. (#1124)
Fixed multimodal prefill batching for 2D padded inputs. (#1126)
Fixed M-RoPE position tensor shape for batched multimodal prefill (BS>1). (#1216)
Fixed preempted prompts and prefill/decoding splitting. (#830)
Fixed grammar bitmask corruption in mixed structured-output batches. (#1200)
Fixed Qwen out of host memory (OOM) errors. (#1247)
Fixed a SharedFusedMoE attribute error for Llama4 MoE layers. (#1172)
Fixed false-positive cross-layer block detection for MLA in NIXL. (#1205)
Fixed block size setting for Granite 4.0h. (#1318)
Fixed Granite4.0h fallback bucket padding. (#1207)
Fixed wrong AI Lab names in validated_models.md. (#1282)
Fixed the -u flag requiring an argument in calibrate_model.sh. (#1121)

Security¶

Fixed coverity issues, including security, null-like values, duplicates, and typos. (#1164)
Fixed SDL secure error handling issues. (#1245)

Deprecation & Breaking Changes¶

Upgraded to Intel® Gaudi® Software v1.24.0 and PyTorch 2.10, which requires users to update their Intel Gaudi software stack from v1.23.0 to v1.24.0.
Removed unused Unified Attention (UA) code. (#1226).
Removed the aggregate module HpuDeepseekOCRVisual. (#1102).
Removed deprecated virtual_engine from ForwardContext. (#1187).

Full Changelog¶

PR	Title	Author
#1334	Fix Synapse GC compile failure for FP8-quantized models	@jiminha
#1330	Mamba prefix caching support for v0.19.0	@jbyczkow
#1323	Fix for proper KV cache slot addressing for Hybrid models	@ksmusz
#1320	Separate conv1d for Granite 4.0 (v0.17.1-style)	@jbyczkow
#1319	Optimizing visible block number for Hybrid kv_cache	@ksmusz
#1318	Fix block size setting for granite 4.0h	@jkaniecki
#1315	Granite-4.0-h Calibration config	@jbyczkow
#1312	Changes for qwen35	@shepark
#1295	Optimize selective_state_update	@jbyczkow
#1287	Resolving kv_cache access in HPUMambaMixer2 and reintroducing Granite4.0 in CI	@ksmusz
#1286	Temporarily removing granite-4-h-small from CI	@ksmusz
#1282	Fix wrong AI Lab names in validated_models.md	@MaxAmende
#1279	Upstream vLLM compatibility fix	@iboiko-habana
#1262	ci: fix EOF error when PR title contains apostrophe	@adobrzyn
#1247	Fix of Qwen Out of HOST memory (OOM)	@iboiko-habana
#1245	SDL secure error handling fixes	@adobrzyn
#1235	qwen35 compact mode	@libinta
#1229	Fix CPUOffloadingSpec import path and remove obsolete roberta patch	@pawel-olejniczak
#1226	Remove dead Unified Attention (UA) code	@adobrzyn
#1225	Add VLLM_REPO and VLLM_GAUDI_REPO args to RHEL UBI Dockerfile	@aung-san-i
#1222	Fix INC/MLA alias-path quantization failures	@pawel-olejniczak
#1221	Granite-4.0-h Calibration config	@mfylcek
#1220	Fix FP8 block-to-channel conversion breaking MLA weight loading	@afierka-intel
#1216	Fix M-RoPE position tensor shape for batched multimodal prefill (BS>1)	@afierka-intel
#1214	Fix param mismatch for compute_nixl_compatibility_hash()	@hsubramony
#1212	Fix include all sub-packages in setuptools package discovery	@Xaenalt
#1207	Granite4.0h fallback bucket padding fix	@mfylcek
#1205	Fix false-positive cross-layer block detection for MLA in NIXL	@iboiko-habana
#1200	Fix grammar bitmask corruption in mixed structured-output batches	@jbyczkow
#1194	Port: Fix SharedFusedMoE attribute error for Llama4 MoE layers	@adobrzyn
#1187	Remove deprecated virtual_engine from ForwardContext	@iboiko-habana
#1183	Fix INC FP8 dynamic quantization for MoE models on HPU	@yeonsily
#1181	Disable nixl CI tests	@iboiko-habana
#1176	Monkey patch for LMCache	@hlin99
#1174	Upstream vLLM hourly fix	@tzielinski-habana
#1173	Update quickstart guide and supported model list	@PatrykWo
#1172	Fix SharedFusedMoE attribute error for Llama4 MoE layers	@adobrzyn
#1169	Add real context length to the high-level profile	@yangulei
#1165	Reintroduce ci test for granite-4-h-small	@microslaw
#1164	Coverity fix including security, null-like values, duplicates and typos	@adobrzyn
#1160	Cap decode block bucket limit to reduce warmup time	@adobrzyn
#1156	Load and Dequant MxFP4 Weights	@SKRohit
#1153	qwen35 initial enablement	@libinta
#1146	Fix KV cache memory regression from unconditional RowParallelLinear OOT registration	@kamil-kaczor
#1141	Add num_spec field to MambaMixer2 for upstream compatibility	@jbyczkow
#1140	Exclude dummy block from NIXL KV cache registration	@yeonsily
#1136	PR-1054 revert	@jczaja
#1135	Temporary nixl test cases disablement	@iboiko-habana
#1131	Parameterize EXTRA_INDEX_URL	@PatrykWo
#1129	Upstream vLLM compatibility fixes	@iboiko-habana
#1126	Fix multimodal prefill batching for 2D padded inputs	@afierka-intel
#1124	Fix OOM crashes during high-concurrency inference	@afierka-intel
#1123	Add AI agents config files	@kamil-kaczor
#1121	Fix -u flag requiring argument in calibrate_model.sh	@afierka-intel
#1102	Remove aggregate module HpuDeepseekOCRVisual	@jwieczorekhabana
#1100	Add more than 2 models to sleep mode model swapping test	@12010486
#1093	Set reserved mem for Torch compile	@nngokhale
#1092	Creating custom depthwise conv1d kernel for MambaMixer2	@ksmusz
#986	Adapt Online defragmenter for torch compile	@jwieczorekhabana
#830	Fix preempted prompts and prefill/decoding splitting	@yangulei

New Contributors¶

Welcome to the following first-time contributors to vLLM Gaudi Plugin!

@aung-san-i — Add VLLM_REPO and VLLM_GAUDI_REPO args to RHEL UBI Dockerfile (#1225)
@MaxAmende — Fix wrong AI Lab names in validated_models.md (#1282)
@Xaenalt — Fix setuptools package discovery to include sub-packages (#1212)
@12010486 — Add more than 2 models to sleep mode model swapping test (#1100)
@hlin99 — Monkey patch for LMCache (#1176)