Kimi-K2-Thinking#
1 Introduction#
Kimi-K2-Thinking is a large-scale Mixture-of-Experts (MoE) model developed by Moonshot AI. It features a hybrid thinking architecture that excels in complex reasoning and problem-solving tasks.
This document will demonstrate the main verification steps of the model, including supported features, environment preparation, installation, online service deployment, functional verification, accuracy evaluation, performance evaluation, performance tuning, and FAQ.
This document is recommended to use the latest release candidate or official version.
2 Supported Features#
Refer to supported features to get the model’s supported feature matrix.
Refer to feature guide to get the feature’s configuration.
3 Prerequisites#
3.1 Model Weight#
Kimi-K2-Thinking(bfloat16): requires 1 Atlas 800 A3 (64G x 16) node. Download model weight.
It is recommended to download the model weight to the shared directory, such as /mnt/sfs_turbo/.cache/.
After downloading the model weights, please edit the value of "quantization_config.config_groups.group_0.targets" from ["Linear"] to ["MoE"] in config.json of the original model to verify the quantized model.
{
"quantization_config": {
"config_groups": {
"group_0": {
"targets": [
"MoE"
]
}
}
}
}
Your model files should look like:
.
|-- chat_template.jinja
|-- config.json
|-- configuration_deepseek.py
|-- configuration.json
|-- generation_config.json
|-- model-00001-of-000062.safetensors
|-- ...
|-- model-00062-of-000062.safetensors
|-- model.safetensors.index.json
|-- modeling_deepseek.py
|-- tiktoken.model
|-- tokenization_kimi.py
|-- tokenizer_config.json
4 Installation#
4.1 Docker Image Installation#
You can use the official Docker image to run Kimi-K2-Thinking directly.
Select an image based on your machine type and start the Docker image on your node, refer to using docker.
# Update the vllm-ascend image according to your environment.
export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1-a3
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/home/cache \
-it $IMAGE bash
Parameter Descriptions:
IMAGE: specifies thevllm-ascendimage. The-a3suffix selects the Atlas A3 image.NAME: specifies the container name.--net=host: uses host networking, so the vLLM service port is exposed on the host directly.--shm-size=1g: configures container shared memory.--device /dev/davinci[0-15]: exposes 16 Ascend NPU devices to the container.--device /dev/davinci_manager,--device /dev/devmm_svm, and--device /dev/hisi_hdc: expose required Ascend runtime device files.-v /usr/local/dcmi:/usr/local/dcmi: mounts DCMI tools for device management.-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi: mounts the NPU monitoring command.-v /usr/local/Ascend/driver/*: mounts Ascend driver libraries and version files.-v /etc/ascend_install.info:/etc/ascend_install.info: mounts Ascend installation metadata.-v /mnt/sfs_turbo/.cache:/home/cache: mounts the shared model cache directory. Update it if you store model weights elsewhere.
After the container starts, run the following command on the host to verify the container status:
docker ps --filter name=vllm-ascend --format "table {{.Names}}\t{{.Status}}"
Expected Status:
The container name is
vllm-ascend.The status is
Up ....The container does not exit immediately.
Run the following command in the container to verify that Ascend devices are visible:
npu-smi info
Expected Status:
The command exits successfully.
The output lists the expected NPU devices.
Device health status is normal.
4.2 Source Code Installation#
If you do not want to use the Docker image, you can also build from source:
# Install vLLM.
git clone --depth 1 --branch v0.22.1 https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install -e .
cd ..
# Install vLLM Ascend.
git clone --depth 1 --branch v0.22.1rc1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e .
To verify the source installation, run:
python -c "import vllm; import vllm_ascend; print('vllm and vllm_ascend import ok')"
Expected Status:
The command exits successfully.
vllm and vllm_ascend import okis printed.
If you want to deploy a multi-node environment, set up the same software environment on each node.
5 Online Service Deployment#
5.1 Single-Node Online Deployment#
Single-node deployment completes both Prefill and Decode within the same node, suitable for online inference scenarios with moderate concurrency requirements.
For an Atlas 800 A3 (64G x 16) node, tensor-parallel-size should be at least 16.
Run the following script to start the vLLM server:
export HCCL_BUFFSIZE=1024
export TASK_QUEUE_ENABLE=1
export OMP_PROC_BIND=false
export HCCL_OP_EXPANSION_MODE=AIV
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SERVER_PORT=8000
vllm serve moonshotai/Kimi-K2-Thinking \
--tensor-parallel-size 16 \
--port $SERVER_PORT \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--max-num-seqs 12 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--enable-expert-parallel \
--no-enable-prefix-caching
Parameter and Environment Variable Descriptions:
HCCL_BUFFSIZE=1024: configures the HCCL buffer size.TASK_QUEUE_ENABLE=1: enables task queue scheduling.OMP_PROC_BIND=false: avoids overly strict OpenMP CPU binding.HCCL_OP_EXPANSION_MODE=AIV: enables the AIV communication path.PYTORCH_NPU_ALLOC_CONF=expandable_segments:True: reduces NPU memory fragmentation.SERVER_PORT: sets the service port. The generated script mapsDEFAULT_PORTto8000.--tensor-parallel-size 16: uses 16-way tensor parallelism on the A3 node.--max-model-len 8192: sets the maximum model context length.--max-num-batched-tokens 8192: sets the maximum number of batched tokens.--max-num-seqs 12: sets the maximum number of concurrent sequences.--gpu-memory-utilization 0.9: controls the memory ratio used by vLLM.--trust-remote-code: allows loading model-specific remote code.--enable-expert-parallel: enables expert parallelism for MoE layers.--no-enable-prefix-caching: disables prefix caching for a stable baseline.
Service Verification:
After the service starts, you should see logs similar to:
INFO: Started server process [...]
INFO: Waiting for application startup.
INFO: Application startup complete.
Expected Status:
The server process starts successfully.
No error logs related to HCCL or NPU initialization.
The container does not exit immediately.
6 Functional Verification#
After the service is started, the model can be invoked by sending a prompt:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "moonshotai/Kimi-K2-Thinking",
"messages": [
{"role": "user", "content": "Who are you?"}
],
"temperature": 1.0
}'
Expected Result:
The HTTP status code is
200.choices[0].message.contentcontains the generated assistant response.
7 Accuracy Evaluation#
Using AISBench#
For details, please refer to Using AISBench.
Using lm-eval#
You can use lm-eval to evaluate the model accuracy through the OpenAI-compatible API.
For lm_eval installation, please refer to Using lm_eval.
Run lm_eval to execute the accuracy evaluation:
lm_eval \
--model local-completions \
--model_args model=moonshotai/Kimi-K2-Thinking,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \
--output_path ./
Reference configuration: gsm8k (5-shot), --apply_chat_template, --fewshot_as_multiturn, greedy decoding (temperature=0.0, top_p=1.0), max 2048 output tokens, batch size 1.
Below are reference gsm8k results for Kimi-K2-Thinking powered by vllm-ascend:v0.20.2rc1, evaluated on one Atlas 800 A3 node (64G × 16).
task |
version |
filter |
n-shot |
metric |
value |
stderr |
|---|---|---|---|---|---|---|
|
3 |
|
5 |
|
0.8992 |
0.0083 |
|
3 |
|
5 |
|
0.8453 |
0.0100 |
8 Performance Evaluation#
Refer to vllm benchmark for more details.
Test Command Example:
vllm bench serve \
--backend openai-chat \
--model moonshotai/Kimi-K2-Thinking \
--endpoint /v1/chat/completions \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 10 \
--request-rate 1
After the benchmark completes, you can get the performance result, including request throughput, output token throughput, TTFT, TPOT, and ITL.
The following reference results are obtained with vllm-ascend:v0.20.2rc1 on one Atlas 800 A3 node (64G × 16), using OpenAI chat serving, random input/output lengths, 10 prompts, and --request-rate 1:
random input len |
random output len |
success |
duration (s) |
request throughput (req/s) |
output throughput (tok/s) |
total throughput (tok/s) |
mean TTFT (ms) |
mean TPOT (ms) |
mean ITL (ms) |
|---|---|---|---|---|---|---|---|---|---|
512 |
512 |
10 / 10 |
111.00 |
0.09 |
46.12 |
94.38 |
507.60 |
200.47 |
200.08 |
1024 |
1024 |
10 / 10 |
221.52 |
0.05 |
46.23 |
93.48 |
566.39 |
208.20 |
208.00 |
2048 |
2048 |
10 / 10 |
479.72 |
0.02 |
42.69 |
85.78 |
722.32 |
230.26 |
230.15 |
For a concurrency sweep, keep the input and output length fixed and vary --max-concurrency:
MODEL_NAME=moonshotai/Kimi-K2-Thinking
INPUT_LEN=1024
OUTPUT_LEN=1024
for CONCURRENCY in 1 2 4 8 16 32; do
NUM_PROMPTS=$((CONCURRENCY * 10))
vllm bench serve \
--backend openai-chat \
--model "$MODEL_NAME" \
--endpoint /v1/chat/completions \
--dataset-name random \
--random-input-len "$INPUT_LEN" \
--random-output-len "$OUTPUT_LEN" \
--num-prompts "$NUM_PROMPTS" \
--request-rate inf \
--max-concurrency "$CONCURRENCY"
done
Reference results for 1024 input tokens and 1024 output tokens are:
max concurrency |
prompts |
success |
duration (s) |
request throughput (req/s) |
output throughput (tok/s) |
total throughput (tok/s) |
mean TTFT (ms) |
P99 TTFT (ms) |
mean TPOT (ms) |
|---|---|---|---|---|---|---|---|---|---|
1 |
10 |
10 / 10 |
595.07 |
0.02 |
17.21 |
34.80 |
473.71 |
712.49 |
57.71 |
2 |
20 |
20 / 20 |
623.88 |
0.03 |
32.83 |
66.35 |
708.16 |
996.59 |
60.29 |
4 |
40 |
40 / 40 |
725.38 |
0.06 |
56.47 |
114.13 |
956.11 |
1137.55 |
69.97 |
8 |
80 |
80 / 80 |
907.44 |
0.09 |
90.28 |
182.43 |
1361.85 |
1900.15 |
87.37 |
16 |
160 |
160 / 160 |
3093.07 |
0.05 |
52.97 |
107.04 |
76766.84 |
251245.22 |
222.07 |
Note: At concurrency levels of 16, the Mean TTFT increases significantly (76.7s), indicating severe queueing delay. For production deployment, it is recommended to limit concurrency based on your latency requirements or increase
--max-num-seqsand--max-num-batched-tokensif NPU memory allows.
9 Performance Tuning#
9.1 Recommended Configurations#
Note: The following configurations are validated in specific test environments and are for reference only. The optimal configuration depends on factors such as maximum input/output length, prefix cache hit rate, precision requirements, and deployment machine ratios. It is recommended to refer to Section 9.2 for tuning based on actual conditions.
Table 1: Scenario Overview#
Scenario |
Deployment Mode |
Total NPUs |
Weight Version |
Key Considerations |
|---|---|---|---|---|
Long Context |
Single-node |
16 (A3) |
bfloat16 |
Keep |
Low Latency |
Single-node |
16 (A3) |
bfloat16 |
Reduce |
High Throughput |
Single-node |
16 (A3) |
bfloat16 |
Increase |
Table 2: Detailed Recommendations#
Long context: use
tp16, keep--max-model-lenclose to the real maximum input and output length, and reduce--max-num-seqsfirst when memory pressure is high.Low latency: reduce
--max-num-seqsand--max-num-batched-tokensto reduce queueing delay.High throughput: increase
--max-num-seqsgradually and benchmark with a request rate close to the real workload. For long-context throughput tests, evaluate--decode-context-parallel-sizeas an optional tuning knob.For 1024 input tokens and 1024 output tokens in the reference concurrency sweep,
--max-concurrency 8had the best output throughput. Higher concurrency increased TTFT significantly, so validate tail latency before using it in production.
Note:
--max-model-lenand--max-num-seqsneed to be set according to the actual usage scenario.If the service runs under high concurrency, verify NPU health and HCCL status before increasing request rate.
9.2 Tuning Guidelines#
9.2.1 General Tuning Reference#
Please refer to the Public Performance Tuning Documentation for general tuning methods.
Please refer to the Feature Guide for detailed feature descriptions.
10 FAQ#
For common environment, installation, and general parameter issues, please refer to the Public FAQ; this chapter only covers model-specific issues.
Q: API returns
{"error":"Model not found"}or404when requesting withmodel: "Kimi-K2-Thinking"?A: The server registers the model under its full path
moonshotai/Kimi-K2-Thinkingby default. When the request uses the short nameKimi-K2-Thinkingwithout--served-model-nameoverride, the server cannot resolve the model ID. Use"model": "moonshotai/Kimi-K2-Thinking"in requests, or start the server with--served-model-name Kimi-K2-Thinkingto enable the short name.