vLLM Gaudi Plugin v0.16.0 Release Notes¶

Overview¶

This release is based on vLLM v0.16.0 and supports Intel® Gaudi® Software v1.23.0.

Highlights¶

Added validated support for the following models: Qwen3-VL, DeepSeek OCR, MiniMax-M2, Ovis, Mistral-Large-3, and Hunyuan V1.
Improved performance by introducing backported bug fixes, mamba improvements, and model weight loading speeds.
Enhanced quantization to force CPU loading for INC quantization to prevent OOM.
Introduced various improvements to UBI/RHEL Docker images, server defaults, and Coverity fixes.

New Model Support and Updates¶

Change Qwen3-VL to use HPUMMEncoderAttention (#1060)
Enable caching for Qwen3 MoE op (#1068)
Fix Qwen3-VL MoE execution failure (#1028)
Enable DeepSeek OCR model (#954)
Add dotsocr and seedoss (#977)
Add MiniMax-M2 support (#964)
Add Ovis model support with default buckets (#846)
Enable Mistral-Large-3-675B-Instruct-2512 model (#871)
Add Hunyuan V1 model support (Dense & MoE bf16/FP8) (#875)

Performance¶

[GAUDISW-246429] hpu_mamba_chunk_scan_combined_varlen improvements (#1074)
Improve model weight loading speed (#807)
Fix warmup regression (#962)

Attention and KV Cache¶

Instead of changing KV cache shape, transpose state in conv1d (#1065)
[GAUDISW-245713] Remove bucket densification for long ctx; Edge buckets only for long ctx (#915)
Temporarily disable chunked attention (#981)
Multimodal model embedding fixes (#759)
[CT] Add FP8 GQA Support (#874)
[CT] Fix CT Config to honor fp8_inc KV cache dtype (#929)

Quantization¶

Force CPU loading for INC quantization to prevent OOM during weight loading (#1055)
Fix INC patching _gate twice (#955)
[GAUDISW-246337] Added config with scale method: maxabs_pcs_pow2 for dynamic quant (#949)

Plugin Core¶

Source use_qk_norm parameter directly from config (#1084)
Fix last_chunk_indices calculations (#1023)
Fix mamba cumsum padded calculations (#1021)
Fix redundant transpose in HPUMambaMixer2 (#1015)
Fix HPUMambaMixer2 inheritance dependency (#1016)
Add _MAMBA_PAD_BLOCK_ID (#951)
Enable OffloadingConnector on HPU. (#827)
GPT OSS Integration Code (#887)
Fix async scheduler + unified attention failure on Qwen2.5-VL (#931)
Fix undefined behavior in copy_blocks when source and destination blocks overlap (#329)

Serving and Infrastructure¶

Fix RHEL Dockerfile build order and remove obsolete TencentOS Dockerfile (#1056)
Improve Docker autocalc linear recipe for long contexts (cherry-pick to 0.16.0) (#1041)
Add libfdt-devel to UBI Dockerfile (#974)
Fix device detection when ENABLE_CONSOLE=true (#963)

Fixes¶

Don't destroy server with logprobs (#1098)
Coverity fix including security, null-like values, duplicates and typos (#1094)
Fix param mismatch for compute_nixl_compatibility_hash() (#1087)
Fix Topk Calculation in GPTOSS (#970)
Fix reported version of vLLM (#811)
Fixing _compile_region for nested attributes (#956)
Fix sampler & TP>1 recompilations (#935)
Restore default temperature=0 for the server after #32723 (#1037)

Full Changelog¶

PR	Title	Author
#1098	Don't destroy server with logprobs	@adobrzyn
#1094	Coverity fix including security, null-like values, duplicates and typos	@adobrzyn
#1087	fix param mismatch for compute_nixl_compatibility_hash()	@hsubramony
#1060	Change Qwen3VL to use HPUMMEncoderAttention	@jiminha
#1068	Enable caching for qwen3 moe op	@shepark
#1084	use_qk_norm parameter sourced directly from config	@rsmyrek
#1056	Fix RHEL Dockerfile build order and remove obsolete TencentOS Dockerfile	@PatrykWo
#1037	Back temperature=0 for server as default after #32723	@iboiko-habana
#1089	Change upstream last_good_commit 89a77b10846fd96273cce78d86d2556ea582d26e	@iboiko-habana
#1041	Improve docker autocalc linear recipe for long contexts (cherry-pick to 0.16.0)	@nngokhale
#1080	Port of #1050 for CI unblocking	@iboiko-habana
#1074	hpu_mamba_chunk_scan_combined_varlen improvements	@PatrykWilczewski
#1057	Add ci test for granite-4-h-small to v0.16.0	@microslaw
#1065	Instead of changing kv cache shape, transpose state in conv1d	@jmamzax
#1023	Fix last_chunk_indices calculations	@jbyczkow
#1021	Fix mamba cumsum padded calculations	@jkaniecki
#999	Fix redundant transpose in HPUMambaMixer2 (#1015)	@ksmusz
#1019	Fixes for #33559 and #34103	@iboiko-habana
#1055	Force CPU loading for INC quantization to prevent OOM during weight loading	@agrabow
#1016	Fix HPUMambaMixer2 inheritance dependency	@jbyczkow
#1028	Fix qwen3 vl moe execution failure	@shepark
#1042	Adding ci_calibration_smoke_tests.sh into v0.16.0	@iboiko-habana
#971	UBI images improvements	@ghandoura
#954	Enable deepseek ocr model	@HeJunyan
#977	Add dotsocr and seedoss	@tianyuan211
#975	Monkey-patch of Attention.forward	@tzielinski-habana
#824	Adjust pre-merge workflow to support merge queue trigger event	@bmyrcha
#970	Fix Topk Calculation in GPTOSS	@SKRohit
#981	Temporarily disable chunked attention	@adobrzyn
#982	adding FIX_FOR_VLLM_CUSTOM to CI	@iboiko-habana
#974	Add libfdt-devel (new habanalabs-thunk dependency) to ubi dockerfile	@mmuszynskihabana
#930	Fix for individual unit tests	@tzielinski-habana
#969	CI cleanup 2	@microslaw
#964	Add MiniMax-M2 support	@testdig
#846	Add ovis models support with default buckets	@testdig
#713	Create UBI based vLLM docker build instructions	@ghandoura
#811	Fix reported version of vllm	@ghandoura
#960	Add docker image cleanup at the end of workflows	@bmyrcha
#962	Fix warmup regression	@kamil-kaczor
#965	Add hf_transfer to test requirements	@bmyrcha
#955	Fix INC patching _gate twice	@kamil-kaczor
#933	CI cleanup	@microslaw
#963	Fix device detection when ENABLE_CONSOLE=true	@afierka-intel
#956	Fixing _compile_region for nested attributes	@ksmusz
#871	Enable Mistral-Large-3-675B-Instruct-2512 model	@skavulya
#915	Remove bucket densification for long ctx; Edge buckets only for long ctx	@kfojcik-intel
#723	Dryrun implementation for generating command line file	@rajanintel24
#759	Multimodal model embedding fixes	@libinta
#329	Fix undefined behavior in copy_blocks when source and destination blocks overlap	@yafshar
#949	Added config with scale method: maxabs_pcs_pow2 for dynamic quant	@HolyFalafel
#951	Add _MAMBA_PAD_BLOCK_ID	@jbyczkow
#875	Add Hunyuan V1 model support (Dense & MoE bf16/FP8)	@jjmiao1
#887	GPT OSS Integration Code	@hlahkar
#916	Port: Initialization profiling noop (#932)	@michalkuligowski
#941	Port profile run off #916	@adobrzyn
#931	Fix async scheduler + unified attention failure on Qwen2.5-VL	@tvoas
#662	Add local path option for hf_cache	@PatrykWo
#940	Missing updates for Llama4 on main	@Luca-Calabria
#902	Add unit tests for multimodal inputs classes	@microslaw
#827	Enable OffloadingConnector on HPU.	@yeonsily
#788	Set device according to local rank	@yangulei
#889	Adapt OnlineDefragmenter and CacheSwapUtils for torc…	@jwieczorekhabana
#923	Modify ubi docker to support both internal and external builds	@mmuszynskihabana
#944	New testowners	@adobrzyn
#893	Fix torch.compile crash in sampler by removing NumPy dependency in tensor padding	@tvoas
#874	[CT] Add FP8 GQA Support	@yiliu30
#807	Improve model weight loading speed	@yupengzh-intel
#935	Fix sampler & TP>1 recompilations	@kamil-kaczor
#929	[CT] Fix CT Config to honor `fp8_inc` KV cache dtype	@yiliu30

New Contributors¶

Welcome to the first-time contributors to the vllm-gaudi plugin!

@agrabowski #1055 'Force CPU loading for INC quantization to prevent OOM during weight loading'