vLLM Gaudi Plugin v0.19.0 Release Notes¶
Overview¶
This release is based on vLLM v0.19.0 and supports Intel® Gaudi® Software v1.24.0 with PyTorch 2.10.
Highlights¶
- Upgraded platform compatibility to Intel® Gaudi® Software v1.24.0 and PyTorch 2.10.
- Introduced Qwen 3.5 model support with compact mode for improved memory utilization.
- Introduced Mamba prefix caching support for hybrid SSM-Transformer models on v0.19.0.
- Added MxFP4 weight loading and dequantization for next-generation quantization formats.
- Added a BF16 fallback path for compressed-tensors INT8 (W8A8) weights on HPU (
HPUCompressedTensorsW8A8Int8_BF16Fallback). - Integrated LMCache support via monkey-patching for external cache backends.
- Introduced custom depthwise conv1d TPC kernel for MambaMixer2 to improve hybrid model performance.
- Adapted the online defragmenter for torch.compile mode, enabling memory defragmentation in compiled execution and automatically enabling it when contiguous PA is on.
- Added single-process model swap support, exposing the OpenAI-compatible
/v1/models/switchendpoint for in-process model switching without server restarts. - Stabilized long-context decode by bounding decode
block_listgrowth and refining bucketing for non-power-of-twoblock_size(e.g. Granite hybrid models), significantly reducing recompilations and improving TPOT. - Switched the default
PT_HPU_LAZY_MODEin shipped Docker images to0(torch.compile) for both PyTorch upstream and fork builds.
New Model Support¶
- Added initial Qwen 3.5 model support. (#1153)
- Added Qwen 3.5 compact mode. (#1235)
- Added Qwen 3.5 additional changes and fixes. (#1312)
Performance¶
- Created a custom depthwise conv1d kernel for MambaMixer2. (#1092)
- Adapted the online defragmenter for torch.compile. (#986)
- Enabled the online defragmenter automatically when contiguous PA is enabled. (#1402)
- Set reserved memory for torch.compile. (#1093)
- Optimized
selective_state_update. (#1295) - Optimized the visible block number for hybrid KV cache. (#1319)
- Stabilized decode
block_listgrowth for long-context workloads to avoid HPU graph recompilation storms and OOM (significant TPOT/throughput gains on 200K-context runs). (#1376) - Refined prompt bucket filtering, fallback bucket capping, and
mamba_decode_correctorfor non-power-of-twoblock_sizelong-context cases (e.g. Granite hybrid models). (#1389) - Moved multimodal graph warmup under the
PT_COMPILE_ONLY_MODEcontext (torch.compile path only) and reused the processor across resolution buckets to reduce warmup time. (#1368)
Attention & KV Cache¶
- Added Mamba prefix caching support for v0.19.0. (#1330)
- Fixed proper KV cache slot addressing for hybrid models. (#1323)
- Resolved KV cache access in HPUMambaMixer2 and reintroduced Granite4.0 in CI. (#1287)
- Removed dead Unified Attention (UA) code. (#1226)
- Fixed a KV cache memory regression caused by unconditional RowParallelLinear OOT registration. (#1146)
- Excluded dummy block from NIXL KV cache registration. (#1140)
Quantization¶
- Loaded and dequantized MxFP4 weights. (#1156)
- Added
HPUCompressedTensorsW8A8Int8_BF16Fallbackimplementation for INT8 weights. (#1394) - Fixed INC FP8 dynamic quantization for MoE models on HPU. (#1183)
- Fixed FP8 block-to-channel conversion breaking MLA weight loading. (#1220)
- Fixed INC/MLA alias-path quantization failures. (#1222)
- Added Granite-4.0-h calibration config. (#1221)
- Fixed Synapse GC compile failure for FP8-quantized models. (#1334)
- Renamed FP8 blockwise compressed-tensors scales to match HPU ops, fixing a regression in Mistral-Large-3-675B (cherry-pick of #1304). (#1374)
Plugin Core¶
- Added a patch for LMCache. (#1176)
- Added single-process model swap support exposing the OpenAI-compatible
/v1/models/switchendpoint, allowing sequential serving of multiple small models within a single API server process without restarts (cherry-pick of #1258). (#1367) - Updated the HPU hetero NIXL connector class to track the new vLLM 0.19.0 NIXL interface. (#1373)
- Removed aggregate module HpuDeepseekOCRVisual. (#1102)
- Removed deprecated
virtual_enginefromForwardContext. (#1187) - Fixed CPUOffloadingSpec import path and removed obsolete roberta patch. (#1229)
- Separated conv1d for Granite 4.0 (v0.17.1-style). (#1320)
- Added the
num_specfield to MambaMixer2 for upstream compatibility. (#1141) - Fixed setuptools package discovery to include all sub-packages. (#1212)
Serving & Infrastructure¶
- Parameterized EXTRA_INDEX_URL in Dockerfiles. (#1131)
- Added VLLM_REPO and VLLM_GAUDI_REPO arguments to RHEL UBI Dockerfile. (#1225)
- Set
PT_HPU_LAZY_MODE=0(torch.compile) as the default in shipped Docker images for both PyTorch upstream and fork builds, with accompanying documentation updates. (#1397) - Added real context length to the high-level profile. (#1169)
- Added more than 2 models to the sleep mode model swapping test. (#1100)
- Added AI agents config files. (#1123)
- Updated the quickstart guide and supported model list. (#1173)
Fixes¶
- Fixed OOM crashes during high-concurrency inference. (#1124)
- Fixed multimodal prefill batching for 2D padded inputs. (#1126)
- Fixed M-RoPE position tensor shape for batched multimodal prefill (BS>1). (#1216)
- Fixed preempted prompts and prefill/decoding splitting. (#830)
- Fixed grammar bitmask corruption in mixed structured-output batches. (#1200)
- Fixed Qwen out of host memory (OOM) errors. (#1247)
- Fixed a
SharedFusedMoEattribute error for Llama4 MoE layers. (#1172) - Fixed false-positive cross-layer block detection for MLA in NIXL. (#1205)
- Fixed block size setting for Granite 4.0h. (#1318)
- Fixed Granite4.0h fallback bucket padding. (#1207)
- Fixed wrong AI Lab names in validated_models.md. (#1282)
- Fixed the
-uflag requiring an argument in calibrate_model.sh. (#1121) - Flattened 3D
inputs_embedsinHpuModelAdapter.forwardto fix a shape mismatch on upstream VL models (e.g. Qwen3-VL-MoE deepstack) when using 2D padded prefill batches. (#1380)
Security¶
- Fixed coverity issues, including security, null-like values, duplicates, and typos. (#1164)
- Fixed SDL secure error handling issues. (#1245)
Deprecation & Breaking Changes¶
- Upgraded to Intel® Gaudi® Software v1.24.0 and PyTorch 2.10, which requires users to update their Intel Gaudi software stack from v1.23.0 to v1.24.0.
- Changed the default
PT_HPU_LAZY_MODEin shipped Docker images to0(torch.compile) for both PyTorch upstream and fork builds. Lazy mode users must opt back in explicitly. (#1397) - Reverted the v0.19.0 backport of "Cap decode block bucket limit to reduce warmup time" (#1160) on the
releases/v0.19.0branch; long-context decode is now stabilized via #1376 and #1389 instead. (#1388) - Removed unused Unified Attention (UA) code. (#1226).
- Removed the aggregate module
HpuDeepseekOCRVisual. (#1102). - Removed deprecated
virtual_enginefromForwardContext. (#1187).
Full Changelog¶
| PR | Title | Author |
|---|---|---|
| #1402 | To enable defrag if contig_pa is enabled | @iboiko-habana |
| #1397 | Cherry-pick to v0.19.0 Set Docker auto calc PT_HPU_LAZY_MODE=0 as default | @nngokhale |
| #1394 | HPUCompressedTensorsW8A8Int8_BF16Fallback impl | @jbyczkow |
| #1389 | Bucketing edge cases finetune for longer ctx (#1362) | @ksmusz |
| #1388 | [v0.19.0] Revert "Cap decode block bucket limit to reduce warmup time (#1160)" | @adobrzyn |
| #1380 | Flatten 3D inputs_embeds in HpuModelAdapter.forward | @shepark |
| #1376 | Stabilize decode block_list growth for long-context workloads (v0.19.0) | @adobrzyn |
| #1374 | Cherry-pick: Updated fix regression in Mistral-Large-3-675B (#1304) for v0.19.0 | @skavulya |
| #1373 | v0.19.0 interface fixes for hetero nixl connector | @sandeep-maddipatla |
| #1368 | Move mm graph warmup under pt compile only context | @shepark |
| #1367 | Cherry-pick from PR#1258 (single-process model swap) | @12010486 |
| #1334 | Fix Synapse GC compile failure for FP8-quantized models | @jiminha |
| #1330 | Mamba prefix caching support for v0.19.0 | @jbyczkow |
| #1323 | Fix for proper KV cache slot addressing for Hybrid models | @ksmusz |
| #1320 | Separate conv1d for Granite 4.0 (v0.17.1-style) | @jbyczkow |
| #1319 | Optimizing visible block number for Hybrid kv_cache | @ksmusz |
| #1318 | Fix block size setting for granite 4.0h | @jkaniecki |
| #1315 | Granite-4.0-h Calibration config | @jbyczkow |
| #1312 | Changes for qwen35 | @shepark |
| #1295 | Optimize selective_state_update | @jbyczkow |
| #1287 | Resolving kv_cache access in HPUMambaMixer2 and reintroducing Granite4.0 in CI | @ksmusz |
| #1286 | Temporarily removing granite-4-h-small from CI | @ksmusz |
| #1282 | Fix wrong AI Lab names in validated_models.md | @MaxAmende |
| #1279 | Upstream vLLM compatibility fix | @iboiko-habana |
| #1262 | ci: fix EOF error when PR title contains apostrophe | @adobrzyn |
| #1247 | Fix of Qwen Out of HOST memory (OOM) | @iboiko-habana |
| #1245 | SDL secure error handling fixes | @adobrzyn |
| #1235 | qwen35 compact mode | @libinta |
| #1229 | Fix CPUOffloadingSpec import path and remove obsolete roberta patch | @pawel-olejniczak |
| #1226 | Remove dead Unified Attention (UA) code | @adobrzyn |
| #1225 | Add VLLM_REPO and VLLM_GAUDI_REPO args to RHEL UBI Dockerfile | @aung-san-i |
| #1222 | Fix INC/MLA alias-path quantization failures | @pawel-olejniczak |
| #1221 | Granite-4.0-h Calibration config | @mfylcek |
| #1220 | Fix FP8 block-to-channel conversion breaking MLA weight loading | @afierka-intel |
| #1216 | Fix M-RoPE position tensor shape for batched multimodal prefill (BS>1) | @afierka-intel |
| #1214 | Fix param mismatch for compute_nixl_compatibility_hash() | @hsubramony |
| #1212 | Fix include all sub-packages in setuptools package discovery | @Xaenalt |
| #1207 | Granite4.0h fallback bucket padding fix | @mfylcek |
| #1205 | Fix false-positive cross-layer block detection for MLA in NIXL | @iboiko-habana |
| #1200 | Fix grammar bitmask corruption in mixed structured-output batches | @jbyczkow |
| #1194 | Port: Fix SharedFusedMoE attribute error for Llama4 MoE layers | @adobrzyn |
| #1187 | Remove deprecated virtual_engine from ForwardContext | @iboiko-habana |
| #1183 | Fix INC FP8 dynamic quantization for MoE models on HPU | @yeonsily |
| #1181 | Disable nixl CI tests | @iboiko-habana |
| #1176 | Monkey patch for LMCache | @hlin99 |
| #1174 | Upstream vLLM hourly fix | @tzielinski-habana |
| #1173 | Update quickstart guide and supported model list | @PatrykWo |
| #1172 | Fix SharedFusedMoE attribute error for Llama4 MoE layers | @adobrzyn |
| #1169 | Add real context length to the high-level profile | @yangulei |
| #1165 | Reintroduce ci test for granite-4-h-small | @microslaw |
| #1164 | Coverity fix including security, null-like values, duplicates and typos | @adobrzyn |
| #1160 | Cap decode block bucket limit to reduce warmup time | @adobrzyn |
| #1156 | Load and Dequant MxFP4 Weights | @SKRohit |
| #1153 | qwen35 initial enablement | @libinta |
| #1146 | Fix KV cache memory regression from unconditional RowParallelLinear OOT registration | @kamil-kaczor |
| #1141 | Add num_spec field to MambaMixer2 for upstream compatibility | @jbyczkow |
| #1140 | Exclude dummy block from NIXL KV cache registration | @yeonsily |
| #1136 | PR-1054 revert | @jczaja |
| #1135 | Temporary nixl test cases disablement | @iboiko-habana |
| #1131 | Parameterize EXTRA_INDEX_URL | @PatrykWo |
| #1129 | Upstream vLLM compatibility fixes | @iboiko-habana |
| #1126 | Fix multimodal prefill batching for 2D padded inputs | @afierka-intel |
| #1124 | Fix OOM crashes during high-concurrency inference | @afierka-intel |
| #1123 | Add AI agents config files | @kamil-kaczor |
| #1121 | Fix -u flag requiring argument in calibrate_model.sh | @afierka-intel |
| #1102 | Remove aggregate module HpuDeepseekOCRVisual | @jwieczorekhabana |
| #1100 | Add more than 2 models to sleep mode model swapping test | @12010486 |
| #1093 | Set reserved mem for Torch compile | @nngokhale |
| #1092 | Creating custom depthwise conv1d kernel for MambaMixer2 | @ksmusz |
| #986 | Adapt Online defragmenter for torch compile | @jwieczorekhabana |
| #830 | Fix preempted prompts and prefill/decoding splitting | @yangulei |
New Contributors¶
Welcome to the following first-time contributors to vLLM Gaudi Plugin!
- @aung-san-i — Add VLLM_REPO and VLLM_GAUDI_REPO args to RHEL UBI Dockerfile (#1225)
- @MaxAmende — Fix wrong AI Lab names in validated_models.md (#1282)
- @Xaenalt — Fix setuptools package discovery to include sub-packages (#1212)
- @12010486 — Add more than 2 models to sleep mode model swapping test (#1100)
- @hlin99 — Monkey patch for LMCache (#1176)
- @sandeep-maddipatla — v0.19.0 interface fixes for hetero nixl connector (#1373)