Skip to content

vLLM Gaudi Plugin v0.19.0 Release Notes

Overview

This release is based on vLLM v0.19.0 and supports Intel® Gaudi® Software v1.24.0 with PyTorch 2.10.


Highlights

  • Upgraded platform compatibility to Intel® Gaudi® Software v1.24.0 and PyTorch 2.10.
  • Introduced Qwen 3.5 model support with compact mode for improved memory utilization.
  • Introduced Mamba prefix caching support for hybrid SSM-Transformer models on v0.19.0.
  • Added MxFP4 weight loading and dequantization for next-generation quantization formats.
  • Added a BF16 fallback path for compressed-tensors INT8 (W8A8) weights on HPU (HPUCompressedTensorsW8A8Int8_BF16Fallback).
  • Integrated LMCache support via monkey-patching for external cache backends.
  • Introduced custom depthwise conv1d TPC kernel for MambaMixer2 to improve hybrid model performance.
  • Adapted the online defragmenter for torch.compile mode, enabling memory defragmentation in compiled execution and automatically enabling it when contiguous PA is on.
  • Added single-process model swap support, exposing the OpenAI-compatible /v1/models/switch endpoint for in-process model switching without server restarts.
  • Stabilized long-context decode by bounding decode block_list growth and refining bucketing for non-power-of-two block_size (e.g. Granite hybrid models), significantly reducing recompilations and improving TPOT.
  • Switched the default PT_HPU_LAZY_MODE in shipped Docker images to 0 (torch.compile) for both PyTorch upstream and fork builds.

New Model Support

  • Added initial Qwen 3.5 model support. (#1153)
  • Added Qwen 3.5 compact mode. (#1235)
  • Added Qwen 3.5 additional changes and fixes. (#1312)

Performance

  • Created a custom depthwise conv1d kernel for MambaMixer2. (#1092)
  • Adapted the online defragmenter for torch.compile. (#986)
  • Enabled the online defragmenter automatically when contiguous PA is enabled. (#1402)
  • Set reserved memory for torch.compile. (#1093)
  • Optimized selective_state_update. (#1295)
  • Optimized the visible block number for hybrid KV cache. (#1319)
  • Stabilized decode block_list growth for long-context workloads to avoid HPU graph recompilation storms and OOM (significant TPOT/throughput gains on 200K-context runs). (#1376)
  • Refined prompt bucket filtering, fallback bucket capping, and mamba_decode_corrector for non-power-of-two block_size long-context cases (e.g. Granite hybrid models). (#1389)
  • Moved multimodal graph warmup under the PT_COMPILE_ONLY_MODE context (torch.compile path only) and reused the processor across resolution buckets to reduce warmup time. (#1368)

Attention & KV Cache

  • Added Mamba prefix caching support for v0.19.0. (#1330)
  • Fixed proper KV cache slot addressing for hybrid models. (#1323)
  • Resolved KV cache access in HPUMambaMixer2 and reintroduced Granite4.0 in CI. (#1287)
  • Removed dead Unified Attention (UA) code. (#1226)
  • Fixed a KV cache memory regression caused by unconditional RowParallelLinear OOT registration. (#1146)
  • Excluded dummy block from NIXL KV cache registration. (#1140)

Quantization

  • Loaded and dequantized MxFP4 weights. (#1156)
  • Added HPUCompressedTensorsW8A8Int8_BF16Fallback implementation for INT8 weights. (#1394)
  • Fixed INC FP8 dynamic quantization for MoE models on HPU. (#1183)
  • Fixed FP8 block-to-channel conversion breaking MLA weight loading. (#1220)
  • Fixed INC/MLA alias-path quantization failures. (#1222)
  • Added Granite-4.0-h calibration config. (#1221)
  • Fixed Synapse GC compile failure for FP8-quantized models. (#1334)
  • Renamed FP8 blockwise compressed-tensors scales to match HPU ops, fixing a regression in Mistral-Large-3-675B (cherry-pick of #1304). (#1374)

Plugin Core

  • Added a patch for LMCache. (#1176)
  • Added single-process model swap support exposing the OpenAI-compatible /v1/models/switch endpoint, allowing sequential serving of multiple small models within a single API server process without restarts (cherry-pick of #1258). (#1367)
  • Updated the HPU hetero NIXL connector class to track the new vLLM 0.19.0 NIXL interface. (#1373)
  • Removed aggregate module HpuDeepseekOCRVisual. (#1102)
  • Removed deprecated virtual_engine from ForwardContext. (#1187)
  • Fixed CPUOffloadingSpec import path and removed obsolete roberta patch. (#1229)
  • Separated conv1d for Granite 4.0 (v0.17.1-style). (#1320)
  • Added the num_spec field to MambaMixer2 for upstream compatibility. (#1141)
  • Fixed setuptools package discovery to include all sub-packages. (#1212)

Serving & Infrastructure

  • Parameterized EXTRA_INDEX_URL in Dockerfiles. (#1131)
  • Added VLLM_REPO and VLLM_GAUDI_REPO arguments to RHEL UBI Dockerfile. (#1225)
  • Set PT_HPU_LAZY_MODE=0 (torch.compile) as the default in shipped Docker images for both PyTorch upstream and fork builds, with accompanying documentation updates. (#1397)
  • Added real context length to the high-level profile. (#1169)
  • Added more than 2 models to the sleep mode model swapping test. (#1100)
  • Added AI agents config files. (#1123)
  • Updated the quickstart guide and supported model list. (#1173)

Fixes

  • Fixed OOM crashes during high-concurrency inference. (#1124)
  • Fixed multimodal prefill batching for 2D padded inputs. (#1126)
  • Fixed M-RoPE position tensor shape for batched multimodal prefill (BS>1). (#1216)
  • Fixed preempted prompts and prefill/decoding splitting. (#830)
  • Fixed grammar bitmask corruption in mixed structured-output batches. (#1200)
  • Fixed Qwen out of host memory (OOM) errors. (#1247)
  • Fixed a SharedFusedMoE attribute error for Llama4 MoE layers. (#1172)
  • Fixed false-positive cross-layer block detection for MLA in NIXL. (#1205)
  • Fixed block size setting for Granite 4.0h. (#1318)
  • Fixed Granite4.0h fallback bucket padding. (#1207)
  • Fixed wrong AI Lab names in validated_models.md. (#1282)
  • Fixed the -u flag requiring an argument in calibrate_model.sh. (#1121)
  • Flattened 3D inputs_embeds in HpuModelAdapter.forward to fix a shape mismatch on upstream VL models (e.g. Qwen3-VL-MoE deepstack) when using 2D padded prefill batches. (#1380)

Security

  • Fixed coverity issues, including security, null-like values, duplicates, and typos. (#1164)
  • Fixed SDL secure error handling issues. (#1245)

Deprecation & Breaking Changes

  • Upgraded to Intel® Gaudi® Software v1.24.0 and PyTorch 2.10, which requires users to update their Intel Gaudi software stack from v1.23.0 to v1.24.0.
  • Changed the default PT_HPU_LAZY_MODE in shipped Docker images to 0 (torch.compile) for both PyTorch upstream and fork builds. Lazy mode users must opt back in explicitly. (#1397)
  • Reverted the v0.19.0 backport of "Cap decode block bucket limit to reduce warmup time" (#1160) on the releases/v0.19.0 branch; long-context decode is now stabilized via #1376 and #1389 instead. (#1388)
  • Removed unused Unified Attention (UA) code. (#1226).
  • Removed the aggregate module HpuDeepseekOCRVisual. (#1102).
  • Removed deprecated virtual_engine from ForwardContext. (#1187).

Full Changelog

PR Title Author
#1402 To enable defrag if contig_pa is enabled @iboiko-habana
#1397 Cherry-pick to v0.19.0 Set Docker auto calc PT_HPU_LAZY_MODE=0 as default @nngokhale
#1394 HPUCompressedTensorsW8A8Int8_BF16Fallback impl @jbyczkow
#1389 Bucketing edge cases finetune for longer ctx (#1362) @ksmusz
#1388 [v0.19.0] Revert "Cap decode block bucket limit to reduce warmup time (#1160)" @adobrzyn
#1380 Flatten 3D inputs_embeds in HpuModelAdapter.forward @shepark
#1376 Stabilize decode block_list growth for long-context workloads (v0.19.0) @adobrzyn
#1374 Cherry-pick: Updated fix regression in Mistral-Large-3-675B (#1304) for v0.19.0 @skavulya
#1373 v0.19.0 interface fixes for hetero nixl connector @sandeep-maddipatla
#1368 Move mm graph warmup under pt compile only context @shepark
#1367 Cherry-pick from PR#1258 (single-process model swap) @12010486
#1334 Fix Synapse GC compile failure for FP8-quantized models @jiminha
#1330 Mamba prefix caching support for v0.19.0 @jbyczkow
#1323 Fix for proper KV cache slot addressing for Hybrid models @ksmusz
#1320 Separate conv1d for Granite 4.0 (v0.17.1-style) @jbyczkow
#1319 Optimizing visible block number for Hybrid kv_cache @ksmusz
#1318 Fix block size setting for granite 4.0h @jkaniecki
#1315 Granite-4.0-h Calibration config @jbyczkow
#1312 Changes for qwen35 @shepark
#1295 Optimize selective_state_update @jbyczkow
#1287 Resolving kv_cache access in HPUMambaMixer2 and reintroducing Granite4.0 in CI @ksmusz
#1286 Temporarily removing granite-4-h-small from CI @ksmusz
#1282 Fix wrong AI Lab names in validated_models.md @MaxAmende
#1279 Upstream vLLM compatibility fix @iboiko-habana
#1262 ci: fix EOF error when PR title contains apostrophe @adobrzyn
#1247 Fix of Qwen Out of HOST memory (OOM) @iboiko-habana
#1245 SDL secure error handling fixes @adobrzyn
#1235 qwen35 compact mode @libinta
#1229 Fix CPUOffloadingSpec import path and remove obsolete roberta patch @pawel-olejniczak
#1226 Remove dead Unified Attention (UA) code @adobrzyn
#1225 Add VLLM_REPO and VLLM_GAUDI_REPO args to RHEL UBI Dockerfile @aung-san-i
#1222 Fix INC/MLA alias-path quantization failures @pawel-olejniczak
#1221 Granite-4.0-h Calibration config @mfylcek
#1220 Fix FP8 block-to-channel conversion breaking MLA weight loading @afierka-intel
#1216 Fix M-RoPE position tensor shape for batched multimodal prefill (BS>1) @afierka-intel
#1214 Fix param mismatch for compute_nixl_compatibility_hash() @hsubramony
#1212 Fix include all sub-packages in setuptools package discovery @Xaenalt
#1207 Granite4.0h fallback bucket padding fix @mfylcek
#1205 Fix false-positive cross-layer block detection for MLA in NIXL @iboiko-habana
#1200 Fix grammar bitmask corruption in mixed structured-output batches @jbyczkow
#1194 Port: Fix SharedFusedMoE attribute error for Llama4 MoE layers @adobrzyn
#1187 Remove deprecated virtual_engine from ForwardContext @iboiko-habana
#1183 Fix INC FP8 dynamic quantization for MoE models on HPU @yeonsily
#1181 Disable nixl CI tests @iboiko-habana
#1176 Monkey patch for LMCache @hlin99
#1174 Upstream vLLM hourly fix @tzielinski-habana
#1173 Update quickstart guide and supported model list @PatrykWo
#1172 Fix SharedFusedMoE attribute error for Llama4 MoE layers @adobrzyn
#1169 Add real context length to the high-level profile @yangulei
#1165 Reintroduce ci test for granite-4-h-small @microslaw
#1164 Coverity fix including security, null-like values, duplicates and typos @adobrzyn
#1160 Cap decode block bucket limit to reduce warmup time @adobrzyn
#1156 Load and Dequant MxFP4 Weights @SKRohit
#1153 qwen35 initial enablement @libinta
#1146 Fix KV cache memory regression from unconditional RowParallelLinear OOT registration @kamil-kaczor
#1141 Add num_spec field to MambaMixer2 for upstream compatibility @jbyczkow
#1140 Exclude dummy block from NIXL KV cache registration @yeonsily
#1136 PR-1054 revert @jczaja
#1135 Temporary nixl test cases disablement @iboiko-habana
#1131 Parameterize EXTRA_INDEX_URL @PatrykWo
#1129 Upstream vLLM compatibility fixes @iboiko-habana
#1126 Fix multimodal prefill batching for 2D padded inputs @afierka-intel
#1124 Fix OOM crashes during high-concurrency inference @afierka-intel
#1123 Add AI agents config files @kamil-kaczor
#1121 Fix -u flag requiring argument in calibrate_model.sh @afierka-intel
#1102 Remove aggregate module HpuDeepseekOCRVisual @jwieczorekhabana
#1100 Add more than 2 models to sleep mode model swapping test @12010486
#1093 Set reserved mem for Torch compile @nngokhale
#1092 Creating custom depthwise conv1d kernel for MambaMixer2 @ksmusz
#986 Adapt Online defragmenter for torch compile @jwieczorekhabana
#830 Fix preempted prompts and prefill/decoding splitting @yangulei

New Contributors

Welcome to the following first-time contributors to vLLM Gaudi Plugin!

  • @aung-san-i — Add VLLM_REPO and VLLM_GAUDI_REPO args to RHEL UBI Dockerfile (#1225)
  • @MaxAmende — Fix wrong AI Lab names in validated_models.md (#1282)
  • @Xaenalt — Fix setuptools package discovery to include sub-packages (#1212)
  • @12010486 — Add more than 2 models to sleep mode model swapping test (#1100)
  • @hlin99 — Monkey patch for LMCache (#1176)
  • @sandeep-maddipatla — v0.19.0 interface fixes for hetero nixl connector (#1373)