vLLM Gaudi Plugin v0.17.1 Release Notes¶

Overview¶

This release is based on vLLM v0.17.1 and supports Intel® Gaudi® Software v1.23.0 and Intel® Gaudi® Software v1.24.0.

Highlights¶

Added validated support for Ernie4.5-VL, GPT-OSS (20B/120B), and reranking models (Bert-based, Roberta-based, and Qwen3-based).
Introduced MxFP4 weight loading and dequantization support for Gaudi, enabling GPT-OSS model inference.
Introduced major Mamba/Granite 4.0-h improvements, including prefix caching support, custom depthwise conv1d TPC kernels, and precision enhancements.
Enhanced RowParallel NIC chunking for better distributed inference performance.
Added logprobs output functionality and Granite tool calling accuracy improvements.
Improved stability by fixing grammar bitmask corruption.

New Model Support¶

Added support for Ernie4.5-VL. (#813)
Ported the following reranking models: Bert-based, Roberta-based, and Qwen3-based. (#1001)
Added support for GPT-OSS models (lmsys/gpt-oss-20b-bf16, lmsys/gpt-oss-120b-bf16, openai/gpt-oss-20b, and openai/gpt-oss-120b) via MxFP4 weight dequantization. (#1251)
Enabled caching for Qwen3 MoE op. (#1249)

Performance¶

Optimized the selective_state_update reference in MambaMixer2 decode. (#1244)
Replaced fancy indexing with select and copy for Granite4 state updates. (#1210)
Created a custom depthwise conv1d kernel for MambaMixer2. (#1175)
Improved bf16 precision of _depthwise_conv1d_tpc. (#1203)
Improved hpu_mamba_chunk_scan_combined_varlen. (#997)
Improved RowParallel NIC chunking. (#896)
Added compute_logits to _compile_methods. (#1081)
Blocked B2BMatmul in dynamic quantization. (#1002)

Attention & KV Cache¶

Transposed state in conv1d instead of changing the KV cache shape. (#1025)
Fixed a KV cache memory regression caused by unconditional RowParallelLinear OOT registration. (#1215)
Added prefix caching support for HPUMambaMixer2. (#1198)

Quantization¶

Loaded and dequantized MxFP4 weights. (#1251)
Added a Granite-4.0-h calibration config. (#1221)
Forced CPU loading for INC quantization to prevent OOM during weight loading. (#1006)
Fixed a type mismatch in DeepSeek with fp8_fused_sdpa for MLA prefill. (#978)

Plugin Core¶

Added the num_spec field to MambaMixer2 for upstream compatibility. (#1142)
Fixed a SharedFusedMoE attribute error for Llama4 MoE layers. (#1172)
Removed a redundant transpose in HPUMambaMixer2. (#999)
Fixed an HPUMambaMixer2 inheritance dependency. (#1017)
Fixed Mamba cumsum padded calculations. (#1009)
Sourced the use_qk_norm parameter directly from the config. (#972)
Replaced MM dummy options. (#1085)
Added the logprobs functionality. (#1101)
Improved Granite tool-calling accuracy. (#1018)
Added the torch inference decorator back to warmup. (#1104)
Added a mechanism for adding events to tlparse. (#1054)
Fixed the gemma3 UT by replacing a tuple operation with a TC-friendly equivalent. (#1083)

Serving & Infrastructure¶

Set docker autocalc rules for reserved memory in Torch compile mode. (#1170)
Improved the docker autocalc linear recipe for long contexts. (#959)
Fixed the Dockerfile for RHEL 9.6 builds by updating the package installation order. (#1008)
Installed torchaudio from the CPU wheel to match the PyTorch version in the Dockerfile. (#1110)
Moved the inline Dockerfile to a separate file and added torchaudio. (#1050)
Installed torchaudio in CD Dockerfiles. (#1051)
Added the PT_VERSION argument and installed torchaudio in the Dockerfile. (#1043)
Removed pt_fork and a duplicated package from the UBI image. (#1066)
Restored the server default temperature=0 after #32723. (#1039)
Fixed setuptools package discovery to include sub-packages. (#1219)
Fixed the -u flag requiring an argument in calibrate_model.sh. (#1167)

Fixes¶

Fixed OOM crashes during high-concurrency inference. (#1252)
Fixed Qwen Out of Host Memory (OOM) errors. (#1256)
Fixed grammar bitmask corruption in mixed structured-output batches. (#1199)
Fixed Granite4.0h fallback bucket padding. (#1207)
Fixed a prefill bucket mismatch when prefills with no context were padded. (#1064)
Fixed default max decode blocks in exponential. (#1091)
Fixed an import error for MultiModalBudget. (#1062)
Fixed Qwen3-VL warmup. (#994)
Prevented server crashes when requests were canceled. (#990)
Fixed a parameter mismatch for compute_nixl_compatibility_hash(). (#1224)

Security¶

Fixed SDL secure error handling issues. (#1246)
Fixed coverity issues, including security, null-like values, duplicates, and typos. (#1163)

Full Changelog¶

PR	Title	Author
#813	Add Support for Ernie4.5-VL	@jinyouzhi
#896	RowParallel NIC chunking	@kamil-kaczor
#921	Add fp8 calibration tests to CI	@afierka-intel
#959	Improve docker autocalc linear recipe for long contexts	@nngokhale
#967	Add ci test for granite-4-h-small	@microslaw
#972	use_qk_norm parameter sourced directly from config	@rsmyrek
#978	Fix type mismatch in DeepSeek with fp8_fused_sdpa for mla prefill	@skavulya
#990	Server doesn't crash when request is canceled	@tzielinski-habana
#994	Qwen3-VL WarmUp Fix	@slokesha
#995	Add sleep mode model swapping test for Gaudi	@PatrykWo
#997	hpu_mamba_chunk_scan_combined_varlen improvements	@PatrykWilczewski
#998	Hourly fixes – batch no. 2	@pawel-olejniczak
#999	Fixing redundant transpose in HPUMambaMixer2	@ksmusz
#1001	Ported the reranking models: Bert-based, Roberta-based and Qwen3-based	@gyou2021
#1002	Blocking B2BMatmul in dynamic quantization	@HolyFalafel
#1006	Force CPU loading for INC quantization to prevent OOM during weight loading	@agrabow
#1008	Fix Dockerfile for RHEL 9.6 build by updating package installation order	@PatrykWo
#1009	Fix mamba cumsum padded calculations	@jkaniecki
#1017	Fix HPUMambaMixer2 inheritance dependency	@jbyczkow
#1018	Granite accuracy for tool calling	@adobrzyn
#1025	Instead of changing kv cache shape, transpose state in conv1d	@jmamzax
#1031	Change Qwen3VL to use HPUMMEncoderAttention	@jiminha
#1039	Back temperature=0 for server as default after #32723	@iboiko-habana
#1043	Add PT_VERSION argument and install torchaudio in Dockerfile	@PatrykWo
#1050	Moved inline Dockerfile to a separate file and added torchaudio	@tzielinski-habana
#1051	Install torchaudio in CD Dockerfiles	@tzielinski-habana
#1053	Hourly fixes – batch no. 3	@pawel-olejniczak
#1054	Added mechanism for adding events to tlparse	@jczaja
#1062	Fix import error for MultiModalBudget	@tvoas
#1064	Fix prefill bucket mismatch when prefills with no context are padded	@mfylcek
#1066	UBI image: remove pt_fork and duplicated package	@ghandoura
#1067	Add workflow to update VLLM_COMMUNITY_COMMIT via GitHub Actions	@PatrykWo
#1081	Adding compute_logits to _compile_methods	@ksmusz
#1083	Fix to gemma3 UT — replaced tuple operation by TC friendly equivalent	@jczaja
#1085	Replace mm dummy options	@skaulintel
#1090	Fix for MoE refactor #32344	@iboiko-habana
#1091	Fix for default max decode blocks in exponential	@adobrzyn
#1101	Logprobs functionality	@adobrzyn
#1104	Add torch inference decorator back to warmup	@skaulintel
#1108	Hourly fixes, part 3	@iboiko-habana
#1110	Install torchaudio from CPU wheel to match PyTorch version in Dockerfile	@PatrykWo
#1114	Hourly fixes, part 4	@iboiko-habana
#1115	Fix for vLLM #35503	@iboiko-habana
#1116	Fix for vLLM #35503	@iboiko-habana
#1125	Cherry from 0.16.0 release	@PatrykWo
#1142	Add num_spec field to MambaMixer2 for upstream compatibility	@jbyczkow
#1163	Coverity fix including security, null-like values, duplicates and typos	@adobrzyn
#1167	Fix -u flag requiring argument in calibrate_model.sh	@adobrzyn
#1170	Set docker auto calc rules for reserved memory in Torch compile mode	@nngokhale
#1172	Fix SharedFusedMoE attribute error for Llama4 MoE layers	@adobrzyn
#1175	Creating custom depthwise conv1d kernel for MambaMixer2	@ksmusz
#1178	Update quickstart guide and supported model list	@PatrykWo
#1198	Prefix caching support for HPUMambaMixer2	@jbyczkow
#1199	Fix grammar bitmask corruption in mixed structured-output batches	@jbyczkow
#1203	Improving precision of _depthwise_conv1d_tpc for bf16	@ksmusz
#1207	Granite4.0h fallback bucket padding fix	@mfylcek
#1210	Replacing fancy indexing with select and copy for Granite4 state update	@ksmusz
#1215	Fix KV cache memory regression from unconditional RowParallelLinear OOT registration	@kamil-kaczor
#1219	Fix setuptools package discovery to include sub-packages	@app/copilot-swe-agent
#1221	Granite-4.0-h Calibration config	@mfylcek
#1224	Fix param mismatch for compute_nixl_compatibility_hash	@hsubramony
#1228	Add ci test for granite-4-h-small to 0.17.1	@microslaw
#1244	Optimization of selective_state_update ref in MambaMixer2 decode	@ksmusz
#1246	SDL secure error handling fixes	@adobrzyn
#1249	Enable caching for qwen3 moe op	@shepark
#1251	Load and Dequant MxFP4 Weights	@SKRohit
#1252	Fix OOM crashes during high-concurrency inference	@afierka-intel
#1256	Fix of Qwen Out of HOST memory (OOM)	@iboiko-habana

New Contributors¶

Welcome to the following first-time contributors to vLLM Gaudi Plugin!

@gyou2021 — Ported reranking models: Bert-based, Roberta-based and Qwen3-based (#1001)
@jczaja — Added mechanism for adding events to tlparse (#1054)
@jinyouzhi — Add Support for Ernie4.5-VL (#813)
@mfylcek — Granite4.0h fallback bucket padding fix (#1207)
@pawel-olejniczak — Hourly upstream compatibility fixes (#998)
@skaulintel — Replace mm dummy options and warmup improvements (#1085)