Kimi-K2.6#
1 Introduction#
Kimi K2.6 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
This document is validated and written based on vLLM-Ascend v0.20.0rc1. The current model (Kimi-K2.6) is first supported in this version.
2 Supported Features#
Refer to supported features to get the model’s supported feature matrix.
Refer to feature guide to get the feature’s configuration.
3 Prerequisites#
3.1 Model Weight#
Kimi-K2.6-w4a8(Quantized version for w4a8): requires 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. Download model weight.kimi-k2.6-eagle3(Eagle3 MTP draft model for accelerating inference of Kimi-K2.6): Download model weightKimi-K2.5-DFlash(a speculative decoding framework that leverages a lightweight block diffusion model for parallel drafting): Download model weight
It is recommended to download the model weight to the shared directory of multiple nodes, such as /root/.cache/.
3.2 Verify Multi-node Communication (Optional)#
If you want to deploy multi-node environment, you need to verify multi-node communication according to verify multi-node communication environment.
4 Installation#
4.1 Docker Image Installation#
Select an image based on your machine type and start the docker image on your node, refer to using docker.
A3 series
Start the docker image on your each node.
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
A2 series
Start the docker image on your each node.
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
After a successful docker run, you can verify the running container service by executing the docker ps command.
4.2 Source Code Installation#
If you don’t want to use the docker image as above, you can also build all from source:
Install
vllm-ascendfrom source, refer to installation.
If you want to deploy multi-node environment, you need to set up environment on each node.
To use the tools_call feature, please ensure that your transformers version is 4.57.6 or lower. If vllm-ascend has been upgraded to v0.21 or later, this requirement no longer applies.
5 Online Service Deployment#
5.1 Single-Node Online Deployment#
Single-node deployment completes both Prefill and Decode within the same node. The quantized model Kimi-K2.6-w4a8 can be deployed on 1 Atlas 800 A3 (64G × 16).
While a single-node setup supports all input/output scenarios, consider deploying multinodes for optimal performance.
Startup Command:
#!/bin/sh
export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_MLAPO=1
# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on.
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export HCCL_BUFFSIZE=800
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export VLLM_ASCEND_BALANCE_SCHEDULING=1
vllm serve Eco-Tech/Kimi-K2.6-W4A8 \
--quantization ascend \
--served-model-name kimi_k26 \
--allowed-local-media-path / \
--trust-remote-code \
--tensor-parallel-size 4 \
--data-parallel-size 4 \
--no-enable-prefix-caching \
--enable-expert-parallel \
--port 8088 \
--max-num-seqs 4 \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.9 \
--seed 42 \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--mm-processor-cache-gb 0 \
--mm-encoder-tp-mode data \
--speculative-config '{"method": "dflash","model": "z-lab/Kimi-K2.6-DFlash", "num_speculative_tokens": 15}'
Key Parameter Descriptions:
Setting the environment variable
VLLM_ASCEND_BALANCE_SCHEDULING=1enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated.--max-model-lenspecifies the maximum context length - that is, the sum of input and output tokens for a single request. For performance testing with an input length of 3.5K and output length of 1.5K, a value of16384is sufficient, however, for precision testing, please set it at least35000.--no-enable-prefix-cachingindicates that prefix caching is disabled. To enable it, remove this option.--mm-encoder-tp-modeindicates how to optimize multi-modal encoder inference using tensor parallelism (TP). If you want to test the multimodal inputs, we recommend usingdata.If you use the w4a8 weight, more memory will be allocated to kvcache, and you can try to increase system throughput to achieve greater throughput.
Common Issues Tip: If you encounter issues, please refer to the Public FAQ for troubleshooting.
Service Verification:
curl http://<node0_ip>:8088/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "kimi_k26",
"messages": [{
"role": "user",
"content": [
{
"type": "text",
"text": "The future of AI is"
}]
}],
"max_tokens": 1024,
"temperature": 1.0,
"top_p": 0.95
}'
Expected Result:
The service returns HTTP 200 OK with a JSON response containing the choices field. Example output (content truncated for brevity):
{
"id": "chatcmpl-9df13fd5e539af93",
"object": "chat.completion",
"created": 1780971952,
"model": "kimi_k26",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The future of AI is not a destination we are passively approaching, but a design problem we are actively solving right now...",
"reasoning": "The user is asking for my thoughts on \"The future of AI is\"...",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null
},
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"token_ids": null
}
],
"usage": {
"prompt_tokens": 13,
"total_tokens": 1037,
"completion_tokens": 1024,
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": null,
"accepted_prediction_tokens": null,
"rejected_prediction_tokens": null
}
}
}
5.2 Multi-Node PD Separation Deployment#
We recommend using Mooncake for deployment: Mooncake.
In the standard single-node deployment mode, Prefill (prompt processing) and Decode (token generation) tasks run on the same set of NPUs. This can lead to two issues:
Prefill preemption interrupts Decode: Prefill is a compute-intensive task that processes the entire input context at once, while Decode generates tokens one by one. When a new user request arrives, its Prefill phase can preempt and interrupt ongoing Decode tasks, causing jitter and higher time-per-output-token (TPOT) latency.
Inflexible resource allocation: Prefill and Decode have fundamentally different computational characteristics — Prefill is compute-bound and memory-bandwidth-intensive, while Decode is memory-bandwidth-bound. Running them on the same hardware forces a compromise that satisfies neither optimally.
PD (Prefill-Decode) separation addresses these issues by running Prefill and Decode on dedicated node groups, each configured independently:
Prefill nodes focus on high-throughput prompt processing, optimized for compute and communication (e.g., enabling FlashComm for Allreduce acceleration).
Decode nodes focus on low-latency token generation, optimized for memory bandwidth (e.g., enabling MLAPO fusion operators).
This architecture is recommended for production deployments with concurrent multi-user workloads, where stable latency and high throughput are both required.
Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes) rather than 1P1D (2 nodes), because there is not enough NPU memory to serve high concurrency in 1P1D case.
Kimi-K2.6-w4a8 2P1D: requires 4 Atlas 800 A3 (64G × 16) nodes.
To run the vllm-ascend Prefill-Decode Disaggregation service, you need to deploy a launch_online_dp.py script and a run_dp_template.sh script on each node and deploy a proxy.sh script on prefill master node to forward requests.
launch_online_dp.pyto launch external dp vllm servers. launch_online_dp.pyParameter descriptions:
Parameter
Type
Required
Default
Description
--dp-sizeint
Yes
-
Data parallel size (total number of DP ranks across all nodes).
--tp-sizeint
No
1
Tensor parallel size within each DP rank.
--dp-size-localint
No
(same as
--dp-size)Number of DP ranks on the current node. If not set, defaults to
--dp-size.--dp-rank-startint
No
0
Starting rank offset for data parallel ranks on this node.
--dp-addressstr
Yes
-
IP address of the data parallel master node (node 0).
--dp-rpc-portstr
No
12345
RPC port for data parallel master communication.
--vllm-start-portint
No
9000
Starting port for each vLLM engine instance on this node. Each DP rank’s engine port =
vllm_start_port+ local rank index.Prefill Node 0
run_dp_template.shscript# this obtained through ifconfig # nic_name is the network interface name corresponding to local_ip of the current node nic_name="xxx" local_ip="141.xx.xx.1" # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) node0_ip="xxxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name # [Optional] jemalloc # jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl kernel.sched_migration_cost_ns=50000 export VLLM_RPC_TIMEOUT=3600000 export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000 export HCCL_OP_EXPANSION_MODE="AIV" export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=1 export TASK_QUEUE_ENABLE=1 export ASCEND_BUFFER_POOL=4:8 export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH export HCCL_BUFFSIZE=800 export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 export ASCEND_RT_VISIBLE_DEVICES=$1 vllm serve Eco-Tech/Kimi-K2.6-W4A8 \ --host 0.0.0.0 \ --port $2 \ --data-parallel-size $3 \ --data-parallel-rank $4 \ --data-parallel-address $5 \ --data-parallel-rpc-port $6 \ --tensor-parallel-size $7 \ --enable-expert-parallel \ --seed 1024 \ --quantization ascend \ --served-model-name kimi_k26 \ --trust-remote-code \ --max-num-seqs 4 \ --max-model-len 32768 \ --max-num-batched-tokens 16384 \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 \ --enforce-eager \ --speculative-config '{"method": "eagle3", "model":"lightseekorg/kimi-k2.6-eagle3", "num_speculative_tokens": 3}' \ --additional-config '{"recompute_scheduler_enable":true}' \ --mm-encoder-tp-mode data \ --kv-transfer-config \ '{"kv_connector": "MooncakeConnectorV1", "kv_role": "kv_producer", "kv_port": "30000", "engine_id": "0", "kv_connector_extra_config": { "prefill": { "dp_size": 4, "tp_size": 4 }, "decode": { "dp_size": 8, "tp_size": 4 } } }'
Prefill Node 1
run_dp_template.shscript# this obtained through ifconfig # nic_name is the network interface name corresponding to local_ip of the current node nic_name="xxx" local_ip="141.xx.xx.2" # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) node0_ip="xxxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name # [Optional] jemalloc # jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl kernel.sched_migration_cost_ns=50000 export VLLM_RPC_TIMEOUT=3600000 export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000 export HCCL_OP_EXPANSION_MODE="AIV" export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=1 export TASK_QUEUE_ENABLE=1 export ASCEND_BUFFER_POOL=4:8 export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH export HCCL_BUFFSIZE=800 export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 export ASCEND_RT_VISIBLE_DEVICES=$1 vllm serve Eco-Tech/Kimi-K2.6-W4A8 \ --host 0.0.0.0 \ --port $2 \ --data-parallel-size $3 \ --data-parallel-rank $4 \ --data-parallel-address $5 \ --data-parallel-rpc-port $6 \ --tensor-parallel-size $7 \ --enable-expert-parallel \ --seed 1024 \ --quantization ascend \ --served-model-name kimi_k26 \ --trust-remote-code \ --max-num-seqs 4 \ --max-model-len 32768 \ --max-num-batched-tokens 16384 \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 \ --enforce-eager \ --speculative-config '{"method": "eagle3", "model":"lightseekorg/kimi-k2.6-eagle3", "num_speculative_tokens": 3}' \ --additional-config '{"recompute_scheduler_enable":true}' \ --mm-encoder-tp-mode data \ --kv-transfer-config \ '{"kv_connector": "MooncakeConnectorV1", "kv_role": "kv_producer", "kv_port": "30100", "engine_id": "1", "kv_connector_extra_config": { "prefill": { "dp_size": 4, "tp_size": 4 }, "decode": { "dp_size": 8, "tp_size": 4 } } }'
Decode Node 0
run_dp_template.shscript# this obtained through ifconfig # nic_name is the network interface name corresponding to local_ip of the current node nic_name="xxx" local_ip="141.xx.xx.3" # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) node0_ip="xxxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name # [Optional] jemalloc # jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl kernel.sched_migration_cost_ns=50000 export VLLM_RPC_TIMEOUT=3600000 export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000 export HCCL_OP_EXPANSION_MODE="AIV" export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=1 export TASK_QUEUE_ENABLE=1 export ASCEND_BUFFER_POOL=4:8 export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH export HCCL_BUFFSIZE=800 export VLLM_ASCEND_ENABLE_MLAPO=1 export ASCEND_RT_VISIBLE_DEVICES=$1 vllm serve Eco-Tech/Kimi-K2.6-W4A8 \ --host 0.0.0.0 \ --port $2 \ --data-parallel-size $3 \ --data-parallel-rank $4 \ --data-parallel-address $5 \ --data-parallel-rpc-port $6 \ --tensor-parallel-size $7 \ --enable-expert-parallel \ --seed 1024 \ --quantization ascend \ --served-model-name kimi_k26 \ --trust-remote-code \ --max-num-seqs 8 \ --max-model-len 32768 \ --max-num-batched-tokens 32 \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.91 \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ --additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": false}' \ --speculative-config '{"method": "eagle3", "model":"lightseekorg/kimi-k2.6-eagle3", "num_speculative_tokens": 3}' \ --kv-transfer-config \ '{"kv_connector": "MooncakeConnectorV1", "kv_role": "kv_consumer", "kv_port": "30200", "engine_id": "2", "kv_connector_extra_config": { "prefill": { "dp_size": 4, "tp_size": 4 }, "decode": { "dp_size": 8, "tp_size": 4 } } }'
Decode Node 1
run_dp_template.shscript# this obtained through ifconfig # nic_name is the network interface name corresponding to local_ip of the current node nic_name="xxx" local_ip="141.xx.xx.4" # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) node0_ip="xxxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name # [Optional] jemalloc # jemalloc is for better performance, if `libjemalloc.so` is installed on your machine, you can turn it on. export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl kernel.sched_migration_cost_ns=50000 export VLLM_RPC_TIMEOUT=3600000 export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000 export HCCL_OP_EXPANSION_MODE="AIV" export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=1 export TASK_QUEUE_ENABLE=1 export ASCEND_BUFFER_POOL=4:8 export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH export HCCL_BUFFSIZE=1100 export VLLM_ASCEND_ENABLE_MLAPO=1 export ASCEND_RT_VISIBLE_DEVICES=$1 vllm serve Eco-Tech/Kimi-K2.6-W4A8 \ --host 0.0.0.0 \ --port $2 \ --data-parallel-size $3 \ --data-parallel-rank $4 \ --data-parallel-address $5 \ --data-parallel-rpc-port $6 \ --tensor-parallel-size $7 \ --enable-expert-parallel \ --seed 1024 \ --quantization ascend \ --served-model-name kimi_k26 \ --trust-remote-code \ --max-num-seqs 8 \ --max-model-len 32768 \ --max-num-batched-tokens 4 \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.91 \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ --additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": false}' \ --speculative-config '{"method": "eagle3", "model":"lightseekorg/kimi-k2.6-eagle3", "num_speculative_tokens": 3}' \ --kv-transfer-config \ '{"kv_connector": "MooncakeConnectorV1", "kv_role": "kv_consumer", "kv_port": "30200", "engine_id": "2", "kv_connector_extra_config": { "prefill": { "dp_size": 4, "tp_size": 4 }, "decode": { "dp_size": 8, "tp_size": 4 } } }'
Key Parameter Descriptions:
VLLM_ASCEND_ENABLE_FLASHCOMM1=1: enables the communication optimization function on the prefill nodes.VLLM_ASCEND_ENABLE_MLAPO=1: enables the fusion operator, which can significantly improve performance but consumes more NPU memory. In the Prefill-Decode (PD) separation scenario, enable MLAPO only on decode nodes.recompute_scheduler_enable: true: enables the recomputation scheduler. When the Key-Value Cache (KV Cache) of the decode node is insufficient, requests will be sent to the prefill node to recompute the KV Cache. In the PD separation scenario, it is recommended to enable this configuration on both prefill and decode nodes simultaneously.multistream_overlap_shared_expert: true: When the Tensor Parallelism (TP) size is 1 orenable_shared_expert_dp: true, an additional stream is enabled to overlap the computation process of shared experts for improved efficiency.
Run server for each node:
# p0 python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.xx.xx.1 --dp-rpc-port 12321 --vllm-start-port 7100 # p1 python launch_online_dp.py --dp-size 4 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.xx.xx.2 --dp-rpc-port 12321 --vllm-start-port 7100 # d0 python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 8 --dp-rank-start 0 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100 # d1 python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 8 --dp-rank-start 8 --dp-address 141.xx.xx.3 --dp-rpc-port 12321 --vllm-start-port 7100
Run the
proxy.shscript on the prefill master nodeRun a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository’s examples: load_balance_proxy_server_example.py
python load_balance_proxy_server_example.py \ --port 1999 \ --host 141.xx.xx.1 \ --prefiller-hosts \ 141.xx.xx.1 \ 141.xx.xx.1 \ 141.xx.xx.1 \ 141.xx.xx.1 \ 141.xx.xx.2 \ 141.xx.xx.2 \ 141.xx.xx.2 \ 141.xx.xx.2 \ --prefiller-ports \ 7100 7101 7102 7103 7100 7101 7102 7103 \ --decoder-hosts \ 141.xx.xx.3 \ 141.xx.xx.3 \ 141.xx.xx.3 \ 141.xx.xx.3 \ 141.xx.xx.4 \ 141.xx.xx.4 \ 141.xx.xx.4 \ 141.xx.xx.4 \ --decoder-ports \ 7100 7101 7102 7103 \ 7100 7101 7102 7103 \
cd vllm-ascend/examples/disaggregated_prefill_v1/ bash proxy.sh
Deployment Verification:
After the PD separation service is fully started, send a request through the proxy port on the prefill master node to verify that Prefill and Decode nodes are working correctly together:
curl http://141.xx.xx.1:1999/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "kimi_k26",
"messages": [{
"role": "user",
"content": [
{
"type": "text",
"text": "The future of AI is"
}]
}],
"max_tokens": 1024,
"temperature": 1.0,
"top_p": 0.95
}'
Expected Result:
The proxy returns HTTP 200 OK. The JSON response contains the choices field with the generated text, confirming that Prefill nodes have successfully processed the prompt and Decode nodes have generated the response:
{
"id": "chatcmpl-xxxxxxxxxxxxx",
"object": "chat.completion",
"model": "kimi_k26",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The future of AI is not a destination we are passively approaching...",
"finish_reason": "length"
}
}
],
"usage": {
"prompt_tokens": 13,
"total_tokens": 1037,
"completion_tokens": 1024
}
}
Common Issues Tip: If you encounter issues with PD separation deployment, please refer to the Public FAQ for troubleshooting.
6 Functional Verification#
Once your server is started, you can query the model with input prompts:
curl http://<node0_ip>:<port>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "kimi_k26",
"messages": [{
"role": "user",
"content": [
{
"type": "text",
"text": "The future of AI is"
}]
}],
"max_tokens": 1024,
"temperature": 1.0,
"top_p": 0.95
}'
Expected Result:
The service returns HTTP 200 OK. The JSON response contains the choices field with the generated text, along with usage statistics:
{
"id": "chatcmpl-9df13fd5e539af93",
"object": "chat.completion",
"created": 1780971952,
"model": "kimi_k26",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The future of AI is not a destination we are passively approaching, but a design problem we are actively solving right now...",
"reasoning": "The user is asking for my thoughts on...",
"finish_reason": "length"
}
}
],
"usage": {
"prompt_tokens": 13,
"total_tokens": 1037,
"completion_tokens": 1024
}
}
7 Accuracy Evaluation#
Here is one accuracy evaluation method.
Using AISBench#
Refer to Using AISBench for details.
After execution, you can get the result. Here is the result of
Kimi-K2.6-w4a8invllm-ascend:v0.20.0rc1for reference only.
dataset |
version |
metric |
mode |
vllm-api-general-chat |
note |
|---|---|---|---|---|---|
AIME2026 |
- |
accuracy |
gen |
90.00 |
1 Atlas 800 A3 (64G × 16) |
GPQA |
- |
accuracy |
gen |
89.90 |
1 Atlas 800 A3 (64G × 16) |
MMMU |
- |
accuracy |
gen |
82.67 |
1 Atlas 800 A3 (64G × 16) |
8 Performance Evaluation#
Using AISBench#
Refer to Using AISBench for performance evaluation for details.
Using vLLM Benchmark#
Run performance evaluation of Kimi-K2.6-w4a8 as an example.
Refer to vllm benchmark for more details.
There are three vllm bench subcommands:
latency: Benchmark the latency of a single batch of requests.serve: Benchmark the online serving throughput.throughput: Benchmark offline inference throughput.
Take the serve as an example. Run the code as follows.
export VLLM_USE_MODELSCOPE=True
vllm bench serve --model Eco-Tech/Kimi-K2.6-w4a8 --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
After about several minutes, you can get the performance evaluation result.
9 Performance Tuning#
9.1 Recommended Configurations#
Note: The following configurations are validated in specific test environments and are for reference only. The optimal configuration depends on factors such as maximum input/output length, prefix cache hit rate, precision requirements, and deployment machine ratios. It is recommended to refer to Section 9.2 for tuning based on actual conditions.
Table 1: Scenario Overview#
*Total NPUsindicates the total number of NPUs used across all nodes. 1 node = 1 Atlas 800 A3 server (64G × 16 NPUs).
Scenario |
Deployment Mode |
*Total NPUs |
Weight Version |
Key Considerations |
|---|---|---|---|---|
High Throughput |
Single-Node Mixed |
16 (A3) |
kimi-k2.6-w4a8 |
Use dp2 tp8 to balance memory capacity and compute efficiency |
High Throughput |
1P1D deployment |
32 (A3) |
kimi-k2.6-w4a8 |
dp2 tp8 on both P and D nodes; balanced latency and throughput |
High Throughput |
2P2D deployment |
64 (A3) |
kimi-k2.6-w4a8 |
Scale from dp4 tp4 to dp8 tp4 across nodes |
Long Context |
Single-Node Mixed |
16 (A3) |
kimi-k2.6-w4a8 |
dp1 tp16 to maximize TP, accommodate extreme context lengths |
Long Context |
Single-Node Mixed |
16 (A3) |
kimi-k2.6-w4a8 |
dp2 tp8 to optimize memory bandwidth and improve cache utilization |
Multimodal |
Single-Node Mixed |
16 (A3) |
kimi-k2.6-w4a8 |
dp1 tp16 for high-resolution visual inputs |
Multimodal |
1P1D deployment |
32 (A3) |
kimi-k2.6-w4a8 |
dp2 tp8 or dp16 tp1, depending on memory and concurrency |
Multimodal |
2P2D deployment |
64 (A3) |
kimi-k2.6-w4a8 |
dp8 tp2 to dp32 tp1, maximize throughput for heavy multimodal workloads |
Table 2: Detailed Node Configuration#
Scenario |
Configuration |
NPUs |
TP |
DP |
Max Model Len |
MTP Speculation Num |
|---|---|---|---|---|---|---|
High Throughput / Low Latency (16K) |
Server / Single Machine |
16 |
8 |
2 |
~16K |
15 |
High Throughput / Low Latency (16K) |
Server-P Node |
16 |
8 |
2 |
~16K |
3 |
High Throughput / Low Latency (16K) |
Server-D Node |
16 |
8 |
2 |
~16K |
3 |
Long Context (128K, no cache) |
Server / Single Machine |
16 |
16 |
1 |
128K |
15 |
Long Context (128K, with cache) |
Server / Single Machine |
16 |
8 |
2 |
128K |
15 |
Multimodal (1080P) |
Server / Single Machine |
16 |
16 |
1 |
~16K |
15 |
Multimodal (1080P) |
Server-P Node |
16 |
8 |
2 |
~16K |
3 |
Multimodal (1080P) |
Server-D Node |
16 |
1 |
16 |
~16K |
3 |
For complete startup commands and parameter descriptions, please refer to the deployment examples in Chapter 5.
Notice:
max-model-len and max-num-seqs need to be set according to the actual usage scenario. For other settings, please refer to the Deployment chapter.
9.2 Tuning Guidelines#
9.2.1 General Tuning Reference#
Please refer to the Public Performance Tuning Documentation for tuning methods.
Please refer to the Feature Guide for detailed feature descriptions.
10 FAQ#
For common environment, installation, and general parameter issues, please refer to the Public FAQ; this chapter only covers model-specific issues.
Q: What transformer version is required for tools_call feature?
A: To use the tools_call feature, please ensure that your transformers version is 4.57.6 or lower. If vllm-ascend has been upgraded to v0.21 or later, this requirement no longer applies.