Skip to content

vLLM Gaudi Plugin v0.16.0 Release Notes

Overview

This release is based on vLLM v0.16.0 and supports Intel® Gaudi® Software v1.23.0.

Highlights

  • Added validated support for the following models: Qwen3-VL, DeepSeek OCR, MiniMax-M2, Ovis, Mistral-Large-3, and Hunyuan V1.
  • Improved performance by introducing backported bug fixes, mamba improvements, and model weight loading speeds.
  • Enhanced quantization to force CPU loading for INC quantization to prevent OOM.
  • Introduced various improvements to UBI/RHEL Docker images, server defaults, and Coverity fixes.

New Model Support and Updates

  • Change Qwen3-VL to use HPUMMEncoderAttention (#1060)
  • Enable caching for Qwen3 MoE op (#1068)
  • Fix Qwen3-VL MoE execution failure (#1028)
  • Enable DeepSeek OCR model (#954)
  • Add dotsocr and seedoss (#977)
  • Add MiniMax-M2 support (#964)
  • Add Ovis model support with default buckets (#846)
  • Enable Mistral-Large-3-675B-Instruct-2512 model (#871)
  • Add Hunyuan V1 model support (Dense & MoE bf16/FP8) (#875)

Performance

  • [GAUDISW-246429] hpu_mamba_chunk_scan_combined_varlen improvements (#1074)
  • Improve model weight loading speed (#807)
  • Fix warmup regression (#962)

Attention and KV Cache

  • Instead of changing KV cache shape, transpose state in conv1d (#1065)
  • [GAUDISW-245713] Remove bucket densification for long ctx; Edge buckets only for long ctx (#915)
  • Temporarily disable chunked attention (#981)
  • Multimodal model embedding fixes (#759)
  • [CT] Add FP8 GQA Support (#874)
  • [CT] Fix CT Config to honor fp8_inc KV cache dtype (#929)

Quantization

  • Force CPU loading for INC quantization to prevent OOM during weight loading (#1055)
  • Fix INC patching _gate twice (#955)
  • [GAUDISW-246337] Added config with scale method: maxabs_pcs_pow2 for dynamic quant (#949)

Plugin Core

  • Source use_qk_norm parameter directly from config (#1084)
  • Fix last_chunk_indices calculations (#1023)
  • Fix mamba cumsum padded calculations (#1021)
  • Fix redundant transpose in HPUMambaMixer2 (#1015)
  • Fix HPUMambaMixer2 inheritance dependency (#1016)
  • Add _MAMBA_PAD_BLOCK_ID (#951)
  • Enable OffloadingConnector on HPU. (#827)
  • GPT OSS Integration Code (#887)
  • Fix async scheduler + unified attention failure on Qwen2.5-VL (#931)
  • Fix undefined behavior in copy_blocks when source and destination blocks overlap (#329)

Serving and Infrastructure

  • Fix RHEL Dockerfile build order and remove obsolete TencentOS Dockerfile (#1056)
  • Improve Docker autocalc linear recipe for long contexts (cherry-pick to 0.16.0) (#1041)
  • Add libfdt-devel to UBI Dockerfile (#974)
  • Fix device detection when ENABLE_CONSOLE=true (#963)

Fixes

  • Don't destroy server with logprobs (#1098)
  • Coverity fix including security, null-like values, duplicates and typos (#1094)
  • Fix param mismatch for compute_nixl_compatibility_hash() (#1087)
  • Fix Topk Calculation in GPTOSS (#970)
  • Fix reported version of vLLM (#811)
  • Fixing _compile_region for nested attributes (#956)
  • Fix sampler & TP>1 recompilations (#935)
  • Restore default temperature=0 for the server after #32723 (#1037)

Full Changelog

PR Title Author
#1098 Don't destroy server with logprobs @adobrzyn
#1094 Coverity fix including security, null-like values, duplicates and typos @adobrzyn
#1087 fix param mismatch for compute_nixl_compatibility_hash() @hsubramony
#1060 Change Qwen3VL to use HPUMMEncoderAttention @jiminha
#1068 Enable caching for qwen3 moe op @shepark
#1084 use_qk_norm parameter sourced directly from config @rsmyrek
#1056 Fix RHEL Dockerfile build order and remove obsolete TencentOS Dockerfile @PatrykWo
#1037 Back temperature=0 for server as default after #32723 @iboiko-habana
#1089 Change upstream last_good_commit 89a77b10846fd96273cce78d86d2556ea582d26e @iboiko-habana
#1041 Improve docker autocalc linear recipe for long contexts (cherry-pick to 0.16.0) @nngokhale
#1080 Port of #1050 for CI unblocking @iboiko-habana
#1074 hpu_mamba_chunk_scan_combined_varlen improvements @PatrykWilczewski
#1057 Add ci test for granite-4-h-small to v0.16.0 @microslaw
#1065 Instead of changing kv cache shape, transpose state in conv1d @jmamzax
#1023 Fix last_chunk_indices calculations @jbyczkow
#1021 Fix mamba cumsum padded calculations @jkaniecki
#999 Fix redundant transpose in HPUMambaMixer2 (#1015) @ksmusz
#1019 Fixes for #33559 and #34103 @iboiko-habana
#1055 Force CPU loading for INC quantization to prevent OOM during weight loading @agrabow
#1016 Fix HPUMambaMixer2 inheritance dependency @jbyczkow
#1028 Fix qwen3 vl moe execution failure @shepark
#1042 Adding ci_calibration_smoke_tests.sh into v0.16.0 @iboiko-habana
#971 UBI images improvements @ghandoura
#954 Enable deepseek ocr model @HeJunyan
#977 Add dotsocr and seedoss @tianyuan211
#975 Monkey-patch of Attention.forward @tzielinski-habana
#824 Adjust pre-merge workflow to support merge queue trigger event @bmyrcha
#970 Fix Topk Calculation in GPTOSS @SKRohit
#981 Temporarily disable chunked attention @adobrzyn
#982 adding FIX_FOR_VLLM_CUSTOM to CI @iboiko-habana
#974 Add libfdt-devel (new habanalabs-thunk dependency) to ubi dockerfile @mmuszynskihabana
#930 Fix for individual unit tests @tzielinski-habana
#969 CI cleanup 2 @microslaw
#964 Add MiniMax-M2 support @testdig
#846 Add ovis models support with default buckets @testdig
#713 Create UBI based vLLM docker build instructions @ghandoura
#811 Fix reported version of vllm @ghandoura
#960 Add docker image cleanup at the end of workflows @bmyrcha
#962 Fix warmup regression @kamil-kaczor
#965 Add hf_transfer to test requirements @bmyrcha
#955 Fix INC patching _gate twice @kamil-kaczor
#933 CI cleanup @microslaw
#963 Fix device detection when ENABLE_CONSOLE=true @afierka-intel
#956 Fixing _compile_region for nested attributes @ksmusz
#871 Enable Mistral-Large-3-675B-Instruct-2512 model @skavulya
#915 Remove bucket densification for long ctx; Edge buckets only for long ctx @kfojcik-intel
#723 Dryrun implementation for generating command line file @rajanintel24
#759 Multimodal model embedding fixes @libinta
#329 Fix undefined behavior in copy_blocks when source and destination blocks overlap @yafshar
#949 Added config with scale method: maxabs_pcs_pow2 for dynamic quant @HolyFalafel
#951 Add _MAMBA_PAD_BLOCK_ID @jbyczkow
#875 Add Hunyuan V1 model support (Dense & MoE bf16/FP8) @jjmiao1
#887 GPT OSS Integration Code @hlahkar
#916 Port: Initialization profiling noop (#932) @michalkuligowski
#941 Port profile run off #916 @adobrzyn
#931 Fix async scheduler + unified attention failure on Qwen2.5-VL @tvoas
#662 Add local path option for hf_cache @PatrykWo
#940 Missing updates for Llama4 on main @Luca-Calabria
#902 Add unit tests for multimodal inputs classes @microslaw
#827 Enable OffloadingConnector on HPU. @yeonsily
#788 Set device according to local rank @yangulei
#889 Adapt OnlineDefragmenter and CacheSwapUtils for torc… @jwieczorekhabana
#923 Modify ubi docker to support both internal and external builds @mmuszynskihabana
#944 New testowners @adobrzyn
#893 Fix torch.compile crash in sampler by removing NumPy dependency in tensor padding @tvoas
#874 [CT] Add FP8 GQA Support @yiliu30
#807 Improve model weight loading speed @yupengzh-intel
#935 Fix sampler & TP>1 recompilations @kamil-kaczor
#929 [CT] Fix CT Config to honor fp8_inc KV cache dtype @yiliu30

New Contributors

Welcome to the first-time contributors to the vllm-gaudi plugin!

  • @agrabowski #1055 'Force CPU loading for INC quantization to prevent OOM during weight loading'