Release note#
v0.10.0rc1 - 2025.08.07#
This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the official doc to get started. V0 is completely removed from this version.
Highlights#
Core#
Ascend PyTorch adapter (torch_npu) has been upgraded to
2.7.1.dev20250724. #1562 And CANN hase been upgraded to8.2.RC1. #1653 Don’t forget to update them in your environment or using the latest images.vLLM Ascend works on Atlas 800I A3 now, and the image on A3 will be released from this version on. #1582
Kimi-K2 with w8a8 quantization, Qwen3-Coder and GLM-4.5 is supported in vLLM Ascend, please following this tutorial to have a try. #2162
Pipeline Parallelism is supported in V1 now. #1800
Prefix cache feature now work with the Ascend Scheduler. #1446
Torchair graph mode works with tp > 4 now. #1508
MTP support torchair graph mode now #2145
Other#
Bug fixes:
Performance improved through a lot of prs:
Caching sin/cos instead of calculate it every layer. #1890
Improve shared expert multi-stream parallelism #1891
Implement the fusion of allreduce and matmul in prefill phase when tp is enabled. Enable this feature by setting
VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCEto1. #1926Optimize Quantized MoE Performance by Reducing All2All Communication. #2195
Use AddRmsNormQuant ops in the custom model to optimize Qwen3’s performance #1806
Use multicast to avoid padding decode request to prefill size #1555
The performance of LoRA has been improved. #1884
A batch of refactoring prs to enhance the code architecture:
Parameters changes:
expert_tensor_parallel_sizeinadditional_configis removed now, and the EP and TP is aligned with vLLM now. #1681Add
VLLM_ASCEND_MLA_PAin environ variables, use this to enable mla paged attention operator for deepseek mla decode.Add
VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCEin environ variables, enableMatmulAllReducefusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance.Add
VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQin environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion.
UT coverage reached 76.34% after a batch of prs followed by this rfc: #1298
Sequence Parallelism works for Qwen3 MoE. #2209
Chinese online document is added now. #1870
Known Issues#
Aclgraph could not work with DP + EP currently, the mainly gap is the number of npu stream that Aclgraph needed to capture graph is not enough. #2229
There is an accuracy issue on W8A8 dynamic quantized DeepSeek with multistream enabled. This will be fixed in the next release. #2232
In Qwen3 MoE, SP cannot be incorporated into the Aclgraph. #2246
MTP not support V1 scheduler currently, will fix it in Q3. #2254
When running MTP with DP > 1, we need to disable metrics logger due to some issue on vLLM. #2254
v0.9.1rc2 - 2025.08.04#
This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.
Highlights#
MOE and dense w4a8 quantization support now: #1320 #1910 #1275 #1480
Dynamic EPLB support in #1943
Disaggregated Prefilling support for V1 Engine and improvement, continued development and stabilization of the disaggregated prefill feature, including performance enhancements and bug fixes for single-machine setups:#1953 #1612 #1361 #1746 #1552 #1801 #2083 #1989
Models improvement:#
DeepSeek DeepSeek DBO support and improvement: #1285 #1291 #1328 #1420 #1445 #1589 #1759 #1827 #2093
DeepSeek MTP improvement and bugfix: #1214 #943 #1584 #1473 #1294 #1632 #1694 #1840 #2076 #1990 #2019
Qwen3 MoE support improvement and bugfix around graph mode and DP: #1940 #2006 #1832
Qwen3 performance improvement around rmsnorm/repo/mlp ops: #1545 #1719 #1726 #1782 #1745
DeepSeek MLA chunked prefill/graph mode/multistream improvement and bugfix: #1240 #933 #1135 #1311 #1750 #1872 #2170 #1551
Qwen2.5 VL improvement via mrope/padding mechanism improvement: #1261 #1705 #1929 #2007
Ray: Fix the device error when using ray and add initialize_cache and improve warning info: #1234 #1501
Graph mode improvement:#
Fix DeepSeek with deepseek with mc2 in #1269
Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions in #1332
Fix torchair_graph_batch_sizes bug in #1570
Enable the limit of tp <= 4 for torchair graph mode in #1404
Fix rope accruracy bug #1887
Support multistream of shared experts in FusedMoE #997
Enable kvcache_nz for the decode process in torchair graph mode#1098
Fix chunked-prefill with torchair case to resolve UnboundLocalError: local variable ‘decode_hs_or_q_c’ issue in #1378
Improve shared experts multi-stream perf for w8a8 dynamic. in #1561
Repair moe error when set multistream. in #1882
Round up graph batch size to tp size in EP case #1610
Fix torchair bug when DP is enabled in #1727
Add extra checking to torchair_graph_config. in #1675
Fix rope bug in torchair+chunk-prefill scenario in #1693
torchair_graph bugfix when chunked_prefill is true in #1748
Improve prefill optimization to support torchair graph mode in #2090
Fix rank set in DP scenario #1247
Reset all unused positions to prevent out-of-bounds to resolve GatherV3 bug in #1397
Remove duplicate multimodal codes in ModelRunner in #1393
Fix block table shape to resolve accuracy issue in #1297
Implement primal full graph with limited scenario in #1503
Restore paged attention kernel in Full Graph for performance in #1677
Fix DeepSeek OOM issue in extreme
--gpu-memory-utilizationscenario in #1829Turn off aclgraph when enabling TorchAir in #2154
Ops improvement:#
Core:#
Upgrade CANN to 8.2.rc1 in #2036
Upgrade torch-npu to 2.5.1.post1 in #2135
Upgrade python to 3.11 in #2136
Disable quantization in mindie_turbo in #1749
fix v0 spec decode in #1323
Enable
ACL_OP_INIT_MODE=1directly only when using V0 spec decode in #1271Refactoring forward_context and model_runner_v1 in #1422
Fix sampling params in #1423
add a switch for enabling NZ layout in weights and enable NZ for GMM. in #1409
Address PrefillCacheHit state to fix prefix cache accuracy bug in #1492
Fix load weight error and add new e2e case in #1651
Optimize the number of rope-related index selections in deepseek. in #1614
add mc2 mask in #1642
Fix static EPLB log2phy condition and improve unit test in #1667 #1896 #2003
add chunk mc2 for prefill in #1703
Fix mc2 op GroupCoordinator bug in #1711
Fix the failure to recognize the actual type of quantization in #1721
Fix deepseek bug when tp_size == 1 in #1755
Added support for delay-free blocks in prefill nodes in #1691
Moe alltoallv communication optimization for unquantized RL training & alltoallv support dpo in #1547
Adapt dispatchV2 interface in #1822
Fix disaggregate prefill hang issue in long output in #1807
Fix flashcomm_v1 when engine v0 in #1859
ep_group is not equal to word_size in some cases. in #1862
Fix wheel glibc version incompatibility in #1808
Fix mc2 process group to resolve self.cpu_group is None in #1831
Pin vllm version to v0.9.1 to make mypy check passed in #1904
Apply npu_moe_gating_top_k_softmax for moe to improve perf in #1902
Fix bug in path_decorator when engine v0 in #1919
Avoid performing cpu all_reduce in disaggregated-prefill scenario. in #1644
add super kernel in decode moe in #1916
[Prefill Perf] Parallel Strategy Optimizations (VRAM-for-Speed Tradeoff) in #1802
Remove unnecessary reduce_results access in shared_experts.down_proj in #2016
Optimize greedy reject sampler with vectorization. in #2002
Make multiple Ps and Ds work on a single machine in #1936
Fixes the shape conflicts between shared & routed experts for deepseek model when tp > 1 and multistream_moe enabled in #2075
Add cpu binding support #2031
Add with_prefill cpu allreduce to handle D-node recomputatio in #2129
Add D2H & initRoutingQuantV2 to improve prefill perf in #2038
Docs:#
Provide an e2e guide for execute duration profiling #1113
Add Referer header for CANN package download url. #1192
Add reinstall instructions doc #1370
Update Disaggregate prefill README #1379
Disaggregate prefill for kv cache register style #1296
Fix errors and non-standard parts in examples/disaggregate_prefill_v1/README.md in #1965
Known Issues#
v0.9.2rc1 - 2025.07.11#
This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the official doc to get started. From this release, V1 engine will be enabled by default, there is no need to set VLLM_USE_V1=1 any more. And this release is the last version to support V0 engine, V0 code will be clean up in the future.
Highlights#
Core#
Ascend PyTorch adapter (torch_npu) has been upgraded to
2.5.1.post1.dev20250619. Don’t forget to update it in your environment. #1347The GatherV3 error has been fixed with aclgraph mode. #1416
W8A8 quantization works on Atlas 300I series now. #1560
Fix the accuracy problem with deploy models with parallel parameters. #1678
The pre-built wheel package now requires lower version of glibc. Users can use it by
pip install vllm-ascenddirectly. #1582
Other#
Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. #1331
A new env variable
VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EPhas been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is0. #1335A new env variable
VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATIONhas been added to improve the performance of topk-topp sampling. The default value is 0, we’ll consider to enable it by default in the future#1732A batch of bugs have been fixed for Data Parallelism case #1273 #1322 #1275 #1478
The DeepSeek performance has been improved. #1194 #1395 #1380
Ascend scheduler works with prefix cache now. #1446
DeepSeek now works with prefix cache now. #1498
Support prompt logprobs to recover ceval accuracy in V1 #1483
Knowissue#
Pipeline parallel does not work with ray and graph mode: https://github.com/vllm-project/vllm-ascend/issues/1751 https://github.com/vllm-project/vllm-ascend/issues/1754
New Contributors#
@xleoken made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1357
@lyj-jjj made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1335
@sharonyunyun made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1194
@Pr0Wh1teGivee made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1308
@leo-pony made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1374
@zeshengzong made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1452
@GDzhu01 made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1477
@Agonixiaoxiao made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1531
@zhanghw0354 made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1476
@farawayboat made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1591
@ZhengWG made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1196
@wm901115nwpu made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1654
Full Changelog: https://github.com/vllm-project/vllm-ascend/compare/v0.9.1rc1…v0.9.2rc1
v0.9.1rc1 - 2025.06.22#
This is the 1st release candidate of v0.9.1 for vLLM Ascend. Please follow the official doc to get started.
Experimental#
Atlas 300I series is experimental supported in this release (Functional test passed with Qwen2.5-7b-instruct/Qwen2.5-0.5b/Qwen3-0.6B/Qwen3-4B/Qwen3-8B). #1333
Support EAGLE-3 for speculative decoding. #1032
After careful consideration, above features will NOT be included in v0.9.1-dev branch (v0.9.1 final release) taking into account the v0.9.1 release quality and the feature rapid iteration. We will improve this from 0.9.2rc1 and later.
Core#
Ascend PyTorch adapter (torch_npu) has been upgraded to
2.5.1.post1.dev20250528. Don’t forget to update it in your environment. #1235Support Atlas 300I series container image. You can get it from quay.io
Fix token-wise padding mechanism to make multi-card graph mode work. #1300
Upgrade vLLM to 0.9.1 [#1165]https://github.com/vllm-project/vllm-ascend/pull/1165
Other Improvements#
Initial support Chunked Prefill for MLA. #1172
An example of best practices to run DeepSeek with ETP has been added. #1101
Performance improvements for DeepSeek using the TorchAir graph. #1098, #1131
Supports the speculative decoding feature with AscendScheduler. #943
Improve
VocabParallelEmbeddingcustom op performance. It will be enabled in the next release. #796Fixed a device discovery and setup bug when running vLLM Ascend on Ray #884
DeepSeek with MC2 (Merged Compute and Communication) now works properly. #1268
Fixed log2phy NoneType bug with static EPLB feature. #1186
Improved performance for DeepSeek with DBO enabled. #997, #1135
Refactoring AscendFusedMoE #1229
Add initial user stories page (include LLaMA-Factory/TRL/verl/MindIE Turbo/GPUStack) #1224
Add unit test framework #1201
Known Issues#
In some cases, the vLLM process may crash with a GatherV3 error when aclgraph is enabled. We are working on this issue and will fix it in the next release. #1038
Prefix cache feature does not work with the Ascend Scheduler but without chunked prefill enabled. This will be fixed in the next release. #1350
Full Changelog#
https://github.com/vllm-project/vllm-ascend/compare/v0.9.0rc2…v0.9.1rc1
New Contributors#
@farawayboat made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1333
@yzim made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1159
@chenwaner made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1098
@wangyanhui-cmss made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1184
@songshanhu07 made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1186
@yuancaoyaoHW made their first contribution in https://github.com/vllm-project/vllm-ascend/pull/1032
Full Changelog: https://github.com/vllm-project/vllm-ascend/compare/v0.9.0rc2…v0.9.1rc1
v0.9.0rc2 - 2025.06.10#
This release contains some quick fixes for v0.9.0rc1. Please use this release instead of v0.9.0rc1.
Highlights#
Fix the import error when vllm-ascend is installed without editable way. #1152
v0.9.0rc1 - 2025.06.09#
This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the official doc to start the journey. From this release, V1 Engine is recommended to use. The code of V0 Engine is frozen and will not be maintained any more. Please set environment VLLM_USE_V1=1 to enable V1 Engine.
Highlights#
DeepSeek works with graph mode now. Follow the official doc to take a try. #789
Qwen series models works with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We’ll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set
enforce_eager=Truewhen initializing the model.
Core#
The performance of multi-step scheduler has been improved. Thanks for the contribution from China Merchants Bank. #814
LoRA、Multi-LoRA And Dynamic Serving is supported for V1 Engine now. Thanks for the contribution from China Merchants Bank. #893
Prefix cache and chunked prefill feature works now #782 #844
Spec decode and MTP features work with V1 Engine now. #874 #890
DP feature works with DeepSeek now. #1012
Input embedding feature works with V0 Engine now. #916
Sleep mode feature works with V1 Engine now. #1084
Model#
Other#
online serve with ascend quantization works now. #877
A batch of bugs for graph mode and moe model have been fixed. #773 #771 #774 #816 #817 #819 #912 #897 #961 #958 #913 #905
A batch of performance improvement PRs have been merged. #784 #803 #966 #839 #970 #947 #987 #1085
From this release, binary wheel package will be released as well. #775
The contributor doc site is added
Known Issue#
In some case, vLLM process may be crashed with aclgraph enabled. We’re working this issue and it’ll be fixed in the next release.
Multi node data-parallel doesn’t work with this release. This is a known issue in vllm and has been fixed on main branch. #18981
v0.7.3.post1 - 2025.05.29#
This is the first post release of 0.7.3. Please follow the official doc to start the journey. It includes the following changes:
Highlights#
Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 is well tested. You can try it now. Mindie Turbo is recomanded to improve the performance of Qwen3. #903 #915
Added a new performance guide. The guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. #878 Doc Link
Bug Fix#
Qwen2.5-VL works for RLHF scenarios now. #928
Users can launch the model from online weights now. e.g. from huggingface or modelscope directly #858 #918
The meaningless log info
UserWorkspaceSize0has been cleaned. #911The log level for
Failed to import vllm_ascend_Chas been changed towarninginstead oferror. #956DeepSeek MLA now works with chunked prefill in V1 Engine. Please note that V1 engine in 0.7.3 is just expermential and only for test usage. #849 #936
Docs#
v0.7.3 - 2025.05.08#
🎉 Hello, World!
We are excited to announce the release of 0.7.3 for vllm-ascend. This is the first official release. The functionality, performance, and stability of this release are fully tested and verified. We encourage you to try it out and provide feedback. We’ll post bug fix versions in the future if needed. Please follow the official doc to start the journey.
Highlights#
This release includes all features landed in the previous release candidates (v0.7.1rc1, v0.7.3rc1, v0.7.3rc2). And all the features are fully tested and verified. Visit the official doc the get the detail feature and model support matrix.
Upgrade CANN to 8.1.RC1 to enable chunked prefill and automatic prefix caching features. You can now enable them now.
Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don’t need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automatically. #662
Integrate MindIE Turbo into vLLM Ascend to improve DeepSeek V3/R1, Qwen 2 series performance. #708
Core#
LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. #700
Model#
Other#
v0.8.5rc1 - 2025.05.06#
This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the official doc to start the journey. Now you can enable V1 egnine by setting the environment variable VLLM_USE_V1=1, see the feature support status of vLLM Ascend in here.
Highlights#
Upgrade CANN version to 8.1.RC1 to support chunked prefill and automatic prefix caching (
--enable_prefix_caching) when V1 is enabled #747Optimize Qwen2 VL and Qwen 2.5 VL #701
Improve Deepseek V3 eager mode and graph mode performance, now you can use –additional_config={‘enable_graph_mode’: True} to enable graph mode. #598 #719
Core#
Upgrade vLLM to 0.8.5.post1 #715
Fix early return in CustomDeepseekV2MoE.forward during profile_run #682
Adapts for new quant model generated by modelslim #719
Initial support on P2P Disaggregated Prefill based on llm_datadist #694
Use
/vllm-workspaceas code path and include.gitin container image to fix issue when start vllm under/workspace#726Optimize NPU memory usage to make DeepSeek R1 W8A8 32K model len work. #728
Fix
PYTHON_INCLUDE_PATHtypo in setup.py #762
Other#
v0.8.4rc2 - 2025.04.29#
This is the second release candidate of v0.8.4 for vllm-ascend. Please follow the official doc to start the journey. Some experimental features are included in this version, such as W8A8 quantization and EP/DP support. We’ll make them stable enough in the next release.
Highlights#
Qwen3 and Qwen3MOE is supported now. Please follow the official doc to run the quick demo. #709
Ascend W8A8 quantization method is supported now. Please take the official doc for example. Any feedback is welcome. #580
DeepSeek V3/R1 works with DP, TP and MTP now. Please note that it’s still in experimental status. Let us know if you hit any problem. #429 #585 #626 #636 #671
Core#
ACLGraph feature is supported with V1 engine now. It’s disabled by default because this feature rely on CANN 8.1 release. We’ll make it available by default in the next release #426
Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don’t need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automatically. #661
Other#
MiniCPM model works now. #645
openEuler container image supported with
v0.8.4-openeulertag and customs Ops build is enabled by default for openEuler OS. #689Fix ModuleNotFoundError bug to make Lora work #600
Add “Using EvalScope evaluation” doc #611
Add a
VLLM_VERSIONenvironment to make vLLM version configurable to help developer set correct vLLM version if the code of vLLM is changed by hand locally. #651
v0.8.4rc1 - 2025.04.18#
This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the official doc to start the journey. From this version, vllm-ascend will follow the newest version of vllm and release every two weeks. For example, if vllm releases v0.8.5 in the next two weeks, vllm-ascend will release v0.8.5rc1 instead of v0.8.4rc2. Please find the detail from the official documentation.
Highlights#
vLLM V1 engine experimental support is included in this version. You can visit official guide to get more detail. By default, vLLM will fallback to V0 if V1 doesn’t work, please set
VLLM_USE_V1=1environment if you want to use V1 forcely.LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. #521.
Sleep Mode feature is supported. Currently it’s only work on V0 engine. V1 engine support will come soon. #513
Core#
The Ascend scheduler is added for V1 engine. This scheduler is more affinity with Ascend hardware. More scheduler policy will be added in the future. #543
Disaggregated Prefill feature is supported. Currently only 1P1D works. NPND is under design by vllm team. vllm-ascend will support it once it’s ready from vLLM. Follow the official guide to use. #432
Spec decode feature works now. Currently it’s only work on V0 engine. V1 engine support will come soon. #500
Structured output feature works now on V1 Engine. Currently it only supports xgrammar backend while using guidance backend may get some errors. #555
Other#
A new communicator
pyhcclis added. It’s used for call CANN HCCL library directly instead of usingtorch.distribute. More usage of it will be added in the next release #503The custom ops build is enabled by default. You should install the packages like
gcc,cmakefirst to buildvllm-ascendfrom source. SetCOMPILE_CUSTOM_KERNELS=0environment to disable the compilation if you don’t need it. #466The custom op
rotay embeddingis enabled by default now to improve the performance. #555
v0.7.3rc2 - 2025.03.29#
This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.
Quickstart with container: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/quick_start.html
Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html
Highlights#
Add Ascend Custom Ops framewrok. Developers now can write customs ops using AscendC. An example ops
rotary_embeddingis added. More tutorials will come soon. The Custom Ops compilation is disabled by default when installing vllm-ascend. SetCOMPILE_CUSTOM_KERNELS=1to enable it. #371V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us here. #376
Prefix cache feature works now. You can set
enable_prefix_caching=Trueto enable it. #282
Core#
Bump torch_npu version to dev20250320.3 to improve accuracy to fix
!!!output problem. #406
Model#
The performance of Qwen2-vl is improved by optimizing patch embedding (Conv3D). #398
Other#
v0.7.3rc1 - 2025.03.14#
🎉 Hello, World! This is the first release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.
Quickstart with container: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/quick_start.html
Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html
Highlights#
DeepSeek V3/R1 works well now. Read the official guide to start! #242
Speculative decoding feature is supported. #252
Multi step scheduler feature is supported. #300
Core#
Bump torch_npu version to dev20250308.3 to improve
_exponentialaccuracyAdded initial support for pooling models. Bert based model, such as
BAAI/bge-base-en-v1.5andBAAI/bge-reranker-v2-m3works now. #229
Model#
Other#
Support MTP(Multi-Token Prediction) for DeepSeek V3/R1 #236
[Docs] Added more model tutorials, include DeepSeek, QwQ, Qwen and Qwen 2.5VL. See the official doc for detail
Pin modelscope<1.23.0 on vLLM v0.7.3 to resolve: https://github.com/vllm-project/vllm/pull/13807
Known issues#
In some cases, especially when the input/output is very long, the accuracy of output may be incorrect. We are working on it. It’ll be fixed in the next release.
Improved and reduced the garbled code in model output. But if you still hit the issue, try to change the generation config value, such as
temperature, and try again. There is also a knonwn issue shown below. Any feedback is welcome. #277
v0.7.1rc1 - 2025.02.19#
🎉 Hello, World!
We are excited to announce the first release candidate of v0.7.1 for vllm-ascend.
vLLM Ascend Plugin (vllm-ascend) is a community maintained hardware plugin for running vLLM on the Ascend NPU. With this release, users can now enjoy the latest features and improvements of vLLM on the Ascend NPU.
Please follow the official doc to start the journey. Note that this is a release candidate, and there may be some bugs or issues. We appreciate your feedback and suggestions here
Highlights#
Core#
Other#
Known issues#
This release relies on an unreleased torch_npu version. It has been installed within official container image already. Please install it manually if you are using non-container environment.
There are logs like
No platform detected, vLLM is running on UnspecifiedPlatformorFailed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")shown when running vllm-ascend. It actually doesn’t affect any functionality and performance. You can just ignore it. And it has been fixed in this PR which will be included in v0.7.3 soon.There are logs like
# CPU blocks: 35064, # CPU blocks: 2730shown when running vllm-ascend which should be# NPU blocks:. It actually doesn’t affect any functionality and performance. You can just ignore it. And it has been fixed in this PR which will be included in v0.7.3 soon.