Release note#

v0.7.3.post1#

This is the first post release of 0.7.3. Please follow the official doc to start the journey. It includes the following changes:

Highlights#

  • Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 is well tested. You can try it now. Mindie Turbo is recomanded to improve the performance of Qwen3. #903 #915

  • Added a new performance guide. The guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. #878 Doc Link

Bug Fix#

  • Qwen2.5-VL works for RLHF scenarios now. #928

  • Users can launch the model from online weights now. e.g. from huggingface or modelscope directly #858 #918

  • The meaningless log info UserWorkspaceSize0 has been cleaned. #911

  • The log level for Failed to import vllm_ascend_C has been changed to warning instead of error. #956

  • DeepSeek MLA now works with chunked prefill in V1 Engine. Please note that V1 engine in 0.7.3 is just expermential and only for test usage. #849 #936

Docs#

  • The benchmark doc is updated for Qwen2.5 and Qwen2.5-VL #792

  • Add the note to clear that only “modelscope<1.23.0” works with 0.7.3. #954

v0.7.3#

🎉 Hello, World!

We are excited to announce the release of 0.7.3 for vllm-ascend. This is the first official release. The functionality, performance, and stability of this release are fully tested and verified. We encourage you to try it out and provide feedback. We’ll post bug fix versions in the future if needed. Please follow the official doc to start the journey.

Highlights#

  • This release includes all features landed in the previous release candidates (v0.7.1rc1, v0.7.3rc1, v0.7.3rc2). And all the features are fully tested and verified. Visit the official doc the get the detail feature and model support matrix.

  • Upgrade CANN to 8.1.RC1 to enable chunked prefill and automatic prefix caching features. You can now enable them now.

  • Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don’t need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automaticlly. #662

  • Integrate MindIE Turbo into vLLM Ascend to improve DeepSeek V3/R1, Qwen 2 series performance. #708

Core#

  • LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. #700

Model#

  • The performance of Qwen2 vl and Qwen2.5 vl is improved. #702

  • The performance of apply_penalties and topKtopP ops are improved. #525

Other#

  • Fixed a issue that may lead CPU memory leak. #691 #712

  • A new environment SOC_VERSION is added. If you hit any soc detection erro when building with custom ops enabled, please set SOC_VERSION to a suitable value. #606

  • openEuler container image supported with v0.7.3-openeuler tag. #665

  • Prefix cache feature works on V1 engine now. #559

v0.7.3rc2#

This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.

  • Quickstart with container: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/quick_start.html

  • Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html

Highlights#

  • Add Ascend Custom Ops framewrok. Developers now can write customs ops using AscendC. An example ops rotary_embedding is added. More tutorials will come soon. The Custome Ops complation is disabled by default when installing vllm-ascend. Set COMPILE_CUSTOM_KERNELS=1 to enable it. #371

  • V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us here. #376

  • Prefix cache feature works now. You can set enable_prefix_caching=True to enable it. #282

Core#

  • Bump torch_npu version to dev20250320.3 to improve accuracy to fix !!! output problem. #406

Model#

  • The performance of Qwen2-vl is improved by optimizing patch embedding (Conv3D). #398

Other#

  • Fixed a bug to make sure multi step scheduler feature work. #349

  • Fixed a bug to make prefix cache feature works with correct accuracy. #424

Known issues#

  • There is a error in the case of long prefix input when set enable_prefix_caching=True as issue show, which rely on CANN 8.1 NNAL package release.

v0.7.3rc1#

🎉 Hello, World! This is the first release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.

  • Quickstart with container: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/quick_start.html

  • Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html

Highlights#

  • DeepSeek V3/R1 works well now. Read the official guide to start! #242

  • Speculative decoding feature is supported. #252

  • Multi step scheduler feature is supported. #300

Core#

  • Bump torch_npu version to dev20250308.3 to improve _exponential accuracy

  • Added initial support for pooling models. Bert based model, such as BAAI/bge-base-en-v1.5 and BAAI/bge-reranker-v2-m3 works now. #229

Model#

  • The performance of Qwen2-VL is improved. #241

  • MiniCPM is now supported #164

Other#

  • Support MTP(Multi-Token Prediction) for DeepSeek V3/R1 #236

  • [Docs] Added more model tutorials, include DeepSeek, QwQ, Qwen and Qwen 2.5VL. See the official doc for detail

  • Pin modelscope<1.23.0 on vLLM v0.7.3 to resolve: https://github.com/vllm-project/vllm/pull/13807

Known issues#

  • In some cases, expecially when the input/output is very long with VL model, the accuracy of output may be incorrect. You may see many ! or some other unreadable code in the output. We are working on it. It’ll be fixed in the next release.

  • Improved and reduced the garbled code in model output. But if you still hit the issue, try to change the gerneration config value, such as temperature and try again. Any feedback is welcome. #277

v0.7.1rc1#

🎉 Hello, World!

We are excited to announce the first release candidate of v0.7.1 for vllm-ascend.

vLLM Ascend Plugin (vllm-ascend) is a community maintained hardware plugin for running vLLM on the Ascend NPU. With this release, users can now enjoy the latest features and improvements of vLLM on the Ascend NPU.

Please follow the official doc to start the journey. Note that this is a release candidate, and there may be some bugs or issues. We appreciate your feedback and suggestions here

Highlights#

  • Initial supports for Ascend NPU on vLLM. #3

  • DeepSeek is now supported. #88 #68

  • Qwen, Llama series and other popular models are also supported, you can see more details in here.

Core#

  • Added the Ascend quantization config option, the implementation will comming soon. #7 #73

  • Add silu_and_mul and rope ops and add mix ops into attention layer. #18

Other#

  • [CI] Enable Ascend CI to actively monitor and improve quality for vLLM on Ascend. #3

  • [Docker] Add vllm-ascend container image #64

  • [Docs] Add a live doc #55

Known issues#

  • This release relies on an unreleased torch_npu version. It has been installed within official container image already. Please install it manually if you are using non-container environment.

  • There are logs like No platform deteced, vLLM is running on UnspecifiedPlatform or Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") shown when runing vllm-ascend. It actually doesn’t affect any functionality and performance. You can just ignore it. And it has been fixed in this PR which will be included in v0.7.3 soon.

  • There are logs like # CPU blocks: 35064, # CPU blocks: 2730 shown when runing vllm-ascend which should be # NPU blocks: . It actually doesn’t affect any functionality and performance. You can just ignore it. And it has been fixed in this PR which will be included in v0.7.3 soon.