vLLM Gaudi Plugin v0.19.0 Release Notes¶
Overview¶
This release is based on vLLM v0.19.0 and supports Intel® Gaudi® Software v1.24.0 with PyTorch 2.10.
Highlights¶
- Upgraded platform compatibility to Intel® Gaudi® Software v1.24.0 and PyTorch 2.10.
- Introduced Qwen 3.5 model support with compact mode for improved memory utilization.
- Introduced Mamba prefix caching support for hybrid SSM-Transformer models on v0.19.0.
- Added MxFP4 weight loading and dequantization for next-generation quantization formats.
- Integrated LMCache support via monkey-patching for external cache backends.
- Introduced custom depthwise conv1d TPC kernel for MambaMixer2 to improve hybrid model performance.
- Adapted the online defragmenter for torch.compile mode, enabling memory defragmentation in compiled execution.
- Improved warmup performance by capping decode block bucket limits.
New Model Support¶
- Added initial Qwen 3.5 model support. (#1153)
- Added Qwen 3.5 compact mode. (#1235)
- Added Qwen 3.5 additional changes and fixes. (#1312)
Performance¶
- Created a custom depthwise conv1d kernel for MambaMixer2. (#1092)
- Adapted the online defragmenter for torch.compile. (#986)
- Set reserved memory for torch.compile. (#1093)
- Capped the decode block bucket limit to reduce warmup time. (#1160)
- Optimized
selective_state_update. (#1295) - Optimized the visible block number for hybrid KV cache. (#1319)
Attention & KV Cache¶
- Added Mamba prefix caching support for v0.19.0. (#1330)
- Fixed proper KV cache slot addressing for hybrid models. (#1323)
- Resolved KV cache access in HPUMambaMixer2 and reintroduced Granite4.0 in CI. (#1287)
- Removed dead Unified Attention (UA) code. (#1226)
- Fixed a KV cache memory regression caused by unconditional RowParallelLinear OOT registration. (#1146)
- Excluded dummy block from NIXL KV cache registration. (#1140)
Quantization¶
- Loaded and dequantized MxFP4 weights. (#1156)
- Fixed INC FP8 dynamic quantization for MoE models on HPU. (#1183)
- Fixed FP8 block-to-channel conversion breaking MLA weight loading. (#1220)
- Fixed INC/MLA alias-path quantization failures. (#1222)
- Added Granite-4.0-h calibration config. (#1221)
- Fixed Synapse GC compile failure for FP8-quantized models. (#1334)
Plugin Core¶
- Added a patch for LMCache. (#1176)
- Removed aggregate module HpuDeepseekOCRVisual. (#1102)
- Removed deprecated
virtual_enginefromForwardContext. (#1187) - Fixed CPUOffloadingSpec import path and removed obsolete roberta patch. (#1229)
- Separated conv1d for Granite 4.0 (v0.17.1-style). (#1320)
- Added the
num_specfield to MambaMixer2 for upstream compatibility. (#1141) - Fixed setuptools package discovery to include all sub-packages. (#1212)
Serving & Infrastructure¶
- Parameterized EXTRA_INDEX_URL in Dockerfiles. (#1131)
- Added VLLM_REPO and VLLM_GAUDI_REPO arguments to RHEL UBI Dockerfile. (#1225)
- Added real context length to the high-level profile. (#1169)
- Added more than 2 models to the sleep mode model swapping test. (#1100)
- Added AI agents config files. (#1123)
- Updated the quickstart guide and supported model list. (#1173)
Fixes¶
- Fixed OOM crashes during high-concurrency inference. (#1124)
- Fixed multimodal prefill batching for 2D padded inputs. (#1126)
- Fixed M-RoPE position tensor shape for batched multimodal prefill (BS>1). (#1216)
- Fixed preempted prompts and prefill/decoding splitting. (#830)
- Fixed grammar bitmask corruption in mixed structured-output batches. (#1200)
- Fixed Qwen out of host memory (OOM) errors. (#1247)
- Fixed a
SharedFusedMoEattribute error for Llama4 MoE layers. (#1172) - Fixed false-positive cross-layer block detection for MLA in NIXL. (#1205)
- Fixed block size setting for Granite 4.0h. (#1318)
- Fixed Granite4.0h fallback bucket padding. (#1207)
- Fixed wrong AI Lab names in validated_models.md. (#1282)
- Fixed the
-uflag requiring an argument in calibrate_model.sh. (#1121)
Security¶
- Fixed coverity issues, including security, null-like values, duplicates, and typos. (#1164)
- Fixed SDL secure error handling issues. (#1245)
Deprecation & Breaking Changes¶
- Upgraded to Intel® Gaudi® Software v1.24.0 and PyTorch 2.10, which requires users to update their Intel Gaudi software stack from v1.23.0 to v1.24.0.
- Removed unused Unified Attention (UA) code. (#1226).
- Removed the aggregate module
HpuDeepseekOCRVisual. (#1102). - Removed deprecated
virtual_enginefromForwardContext. (#1187).
Full Changelog¶
| PR | Title | Author |
|---|---|---|
| #1334 | Fix Synapse GC compile failure for FP8-quantized models | @jiminha |
| #1330 | Mamba prefix caching support for v0.19.0 | @jbyczkow |
| #1323 | Fix for proper KV cache slot addressing for Hybrid models | @ksmusz |
| #1320 | Separate conv1d for Granite 4.0 (v0.17.1-style) | @jbyczkow |
| #1319 | Optimizing visible block number for Hybrid kv_cache | @ksmusz |
| #1318 | Fix block size setting for granite 4.0h | @jkaniecki |
| #1315 | Granite-4.0-h Calibration config | @jbyczkow |
| #1312 | Changes for qwen35 | @shepark |
| #1295 | Optimize selective_state_update | @jbyczkow |
| #1287 | Resolving kv_cache access in HPUMambaMixer2 and reintroducing Granite4.0 in CI | @ksmusz |
| #1286 | Temporarily removing granite-4-h-small from CI | @ksmusz |
| #1282 | Fix wrong AI Lab names in validated_models.md | @MaxAmende |
| #1279 | Upstream vLLM compatibility fix | @iboiko-habana |
| #1262 | ci: fix EOF error when PR title contains apostrophe | @adobrzyn |
| #1247 | Fix of Qwen Out of HOST memory (OOM) | @iboiko-habana |
| #1245 | SDL secure error handling fixes | @adobrzyn |
| #1235 | qwen35 compact mode | @libinta |
| #1229 | Fix CPUOffloadingSpec import path and remove obsolete roberta patch | @pawel-olejniczak |
| #1226 | Remove dead Unified Attention (UA) code | @adobrzyn |
| #1225 | Add VLLM_REPO and VLLM_GAUDI_REPO args to RHEL UBI Dockerfile | @aung-san-i |
| #1222 | Fix INC/MLA alias-path quantization failures | @pawel-olejniczak |
| #1221 | Granite-4.0-h Calibration config | @mfylcek |
| #1220 | Fix FP8 block-to-channel conversion breaking MLA weight loading | @afierka-intel |
| #1216 | Fix M-RoPE position tensor shape for batched multimodal prefill (BS>1) | @afierka-intel |
| #1214 | Fix param mismatch for compute_nixl_compatibility_hash() | @hsubramony |
| #1212 | Fix include all sub-packages in setuptools package discovery | @Xaenalt |
| #1207 | Granite4.0h fallback bucket padding fix | @mfylcek |
| #1205 | Fix false-positive cross-layer block detection for MLA in NIXL | @iboiko-habana |
| #1200 | Fix grammar bitmask corruption in mixed structured-output batches | @jbyczkow |
| #1194 | Port: Fix SharedFusedMoE attribute error for Llama4 MoE layers | @adobrzyn |
| #1187 | Remove deprecated virtual_engine from ForwardContext | @iboiko-habana |
| #1183 | Fix INC FP8 dynamic quantization for MoE models on HPU | @yeonsily |
| #1181 | Disable nixl CI tests | @iboiko-habana |
| #1176 | Monkey patch for LMCache | @hlin99 |
| #1174 | Upstream vLLM hourly fix | @tzielinski-habana |
| #1173 | Update quickstart guide and supported model list | @PatrykWo |
| #1172 | Fix SharedFusedMoE attribute error for Llama4 MoE layers | @adobrzyn |
| #1169 | Add real context length to the high-level profile | @yangulei |
| #1165 | Reintroduce ci test for granite-4-h-small | @microslaw |
| #1164 | Coverity fix including security, null-like values, duplicates and typos | @adobrzyn |
| #1160 | Cap decode block bucket limit to reduce warmup time | @adobrzyn |
| #1156 | Load and Dequant MxFP4 Weights | @SKRohit |
| #1153 | qwen35 initial enablement | @libinta |
| #1146 | Fix KV cache memory regression from unconditional RowParallelLinear OOT registration | @kamil-kaczor |
| #1141 | Add num_spec field to MambaMixer2 for upstream compatibility | @jbyczkow |
| #1140 | Exclude dummy block from NIXL KV cache registration | @yeonsily |
| #1136 | PR-1054 revert | @jczaja |
| #1135 | Temporary nixl test cases disablement | @iboiko-habana |
| #1131 | Parameterize EXTRA_INDEX_URL | @PatrykWo |
| #1129 | Upstream vLLM compatibility fixes | @iboiko-habana |
| #1126 | Fix multimodal prefill batching for 2D padded inputs | @afierka-intel |
| #1124 | Fix OOM crashes during high-concurrency inference | @afierka-intel |
| #1123 | Add AI agents config files | @kamil-kaczor |
| #1121 | Fix -u flag requiring argument in calibrate_model.sh | @afierka-intel |
| #1102 | Remove aggregate module HpuDeepseekOCRVisual | @jwieczorekhabana |
| #1100 | Add more than 2 models to sleep mode model swapping test | @12010486 |
| #1093 | Set reserved mem for Torch compile | @nngokhale |
| #1092 | Creating custom depthwise conv1d kernel for MambaMixer2 | @ksmusz |
| #986 | Adapt Online defragmenter for torch compile | @jwieczorekhabana |
| #830 | Fix preempted prompts and prefill/decoding splitting | @yangulei |
New Contributors¶
Welcome to the following first-time contributors to vLLM Gaudi Plugin!
- @aung-san-i — Add VLLM_REPO and VLLM_GAUDI_REPO args to RHEL UBI Dockerfile (#1225)
- @MaxAmende — Fix wrong AI Lab names in validated_models.md (#1282)
- @Xaenalt — Fix setuptools package discovery to include sub-packages (#1212)
- @12010486 — Add more than 2 models to sleep mode model swapping test (#1100)
- @hlin99 — Monkey patch for LMCache (#1176)