Additional Configuration#
Additional configuration is a mechanism provided by vLLM to allow plugins to control internal behavior by themselves. VLLM Ascend uses this mechanism to make the project more flexible.
Migration Guide#
Starting from PR #9064, vLLM Ascend is migrating 10 environment variables to --additional-config.
Important Notice#
Current Support: Both environment variables and
--additional-configare supported during the transition periodRecommendation: Use
--additional-configfor new deployments and migrate existing configurationsFuture Plan: Environment variables will be removed in a future release; only
--additional-configwill be supported
Quick Reference#
Environment Variable |
Config Key |
Type Conversion |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
Integer (unchanged) |
|
|
|
|
|
|
|
|
Integer (unchanged, field name changed) |
|
|
|
|
|
Integer (unchanged) |
|
|
|
Example Migration#
Before (environment variable):
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
vllm serve Qwen/Qwen3-8B
After (additional-config):
vllm serve Qwen/Qwen3-8B --additional-config='{"enable_flashcomm1": true}'
How to use#
With either online mode or offline mode, users can use additional configuration. Take Qwen3 as an example:
Online mode:
vllm serve Qwen/Qwen3-8B --additional-config='{"config_key":"config_value"}'
Offline mode:
from vllm import LLM
LLM(model="Qwen/Qwen3-8B", additional_config={"config_key":"config_value"})
Configuration options#
The following table lists additional configuration options available in vLLM Ascend:
Name |
Type |
Default |
Description |
|---|---|---|---|
|
dict |
|
Configuration options for Xlite graph mode |
|
dict |
|
Configuration options for weight prefetch |
|
dict |
|
Configuration options for module tensor parallelism |
|
dict |
|
Configuration options for ascend compilation |
|
dict |
|
Configuration options for eplb |
|
bool |
|
Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
|
dict |
|
Inline msprobe dump configuration. vLLM-Ascend will materialize it to a temporary JSON file and pass that file to the debugger. |
|
str |
|
Configuration file path for msprobe dump (compatible legacy option). |
|
bool |
|
Whether to enable asynchronous exponential overlap. To enable asynchronous exponential, set this config to True. |
|
bool |
|
When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. |
|
bool |
|
Whether to enable multi-stream shared expert. This option only takes effect on MoE models with shared experts. |
|
bool |
|
Whether to enable multi-stream overlap gate. This option only takes effect on MoE models with shared experts. |
|
bool |
|
Whether to enable the recompute scheduler. Only valid in PD-disaggregated mode ( |
|
bool |
|
Enables Ascend-native CPU binding on ARM servers. Set to |
|
int |
|
SLO limits for dynamic batch. This is new scheduler to support dynamic batch feature |
|
bool |
|
Whether to enable npugraph_ex graph mode. |
|
list |
|
The custom shape list of page attention ops. |
|
bool |
|
Whether to enable KV cache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek). |
|
dict |
|
Configuration options for Layer Sharding Linear. Layer Sharding can only be enabled in PD-disaggregated’s P node. |
|
bool |
|
Whether to enable KV cache C8 in DSA models (e.g., DeepSeekV3.2 and GLM5). Not supported on A5 devices now |
|
bool |
|
Enable dispatch/combine op inter-node communication by ROCE. |
|
dict |
|
Configuration options for dynamic chunked pipeline parallel. See Dynamic Chunked Pipeline Parallel for details. |
|
bool |
|
Whether to enable balance scheduling. Can also be configured via |
|
bool |
|
Whether to enable FlashComm1 optimization. Can also be configured via |
|
bool |
|
Whether to enable matmul allreduce optimization. Can also be configured via |
|
int |
|
FlashComm2 parallel size. Can also be configured via |
|
bool |
|
Whether to use daemon mode for msmonitor. Can also be configured via |
|
bool |
|
Whether to enable MLAPO (Model Layer-wise Adaptive Parallel Optimization). Can also be configured via |
|
int |
|
Weight NZ mode. Can also be configured via |
|
bool |
|
Whether to enable context parallelism. Can also be configured via |
|
int |
|
Fused MC2 configuration. Can also be configured via |
|
bool |
|
Whether to enable transpose KV cache by block. Can also be configured via |
|
bool |
|
Whether to enable dsa_cp for DeepSeek V3.2, DeepSeek V4, and other models with the same architecture. This feature depends on FLASHCOMM1. Please ensure that FLASHCOMM1 is enabled before enabling this feature. |
The details of each configuration option are as follows:
xlite_graph_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Whether to enable Xlite graph mode. Currently only Llama, Qwen dense series models, and Qwen3-VL are supported. |
|
bool |
|
Whether to enable Xlite for both the prefill and decode stages. By default, Xlite is only enabled for the decode stage. |
weight_prefetch_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Whether to enable weight prefetch. |
|
dict |
|
Prefetch ratio of each weight. |
finegrained_tp_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
The custom tensor parallel size of lm_head. |
|
int |
|
The custom tensor parallel size of o_proj. |
|
int |
|
The custom tensor parallel size of embedding. |
|
int |
|
The custom tensor parallel size of mlp. |
ascend_compilation_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Whether to enable npugraph_ex backend. |
|
bool |
|
Whether to enable static kernel. Suitable for scenarios where shape changes are minimal and some time is available for static kernel compilation. |
|
bool |
|
Whether to enable fuse_norm_quant pass. |
|
bool |
|
Whether to enable fuse_qknorm_rope pass. If Triton is not in the environment, set it to False. |
|
bool |
|
Whether to enable fuse_allreduce_rms pass. It’s set to False because of conflict with SP. |
|
bool |
|
Whether to enable fuse_muls_add pass. |
eplb_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Whether to enable dynamic EPLB. |
|
str |
|
When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
|
int |
|
Forward iterations when EPLB begins. |
|
int |
|
The forward iterations when the EPLB worker will finish CPU tasks. |
|
str |
|
Save the expert load calculation results to a new expert table in the specified directory. |
|
int |
|
Specify redundant experts during initialization. |
profiling_chunk_config
Name |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Whether to enable dynamic chunked pipeline parallel. Requires |
|
float |
|
Smoothing factor (0 < x ≤ 1.0). Higher values trust the dynamic prediction more; |
|
int |
|
Minimum chunk size for dynamic calculation. Should be smaller than |
|
bool |
True |
Enable/disable Online Calibration |
Example#
An example of additional configuration is as follows:
{
"weight_prefetch_config": {
"enabled": True,
"prefetch_ratio": {
"attn": {
"qkv": 1.0,
"o": 1.0,
},
"moe": {
"gate_up": 0.8
},
"mlp": {
"gate_up": 1.0,
"down": 1.0
}
},
},
"finegrained_tp_config": {
"lmhead_tensor_parallel_size": 8,
"oproj_tensor_parallel_size": 8,
"embedding_tensor_parallel_size": 8,
"mlp_tensor_parallel_size": 8,
},
"enable_kv_nz": False,
"multistream_overlap_shared_expert": True,
"refresh": False
}