Kimi-K2-Thinking¶

1 Introduction¶

Kimi-K2-Thinking is a large-scale Mixture-of-Experts (MoE) model developed by Moonshot AI. It features a hybrid thinking architecture that excels in complex reasoning and problem-solving tasks.

This document will demonstrate the main verification steps of the model, including supported features, environment preparation, installation, online service deployment, functional verification, accuracy evaluation, performance evaluation, performance tuning, and FAQ.

This document is recommended to use the latest release candidate or official version.

2 Supported Features¶

Refer to supported features to get the model's supported feature matrix.

Refer to feature guide to get the feature's configuration.

3 Prerequisites¶

3.1 Model Weight¶

Kimi-K2-Thinking (bfloat16): requires 1 Atlas 800 A3 (64G x 16) node. Download model weight.

It is recommended to download the model weight to the shared directory, such as /mnt/sfs_turbo/.cache/.

After downloading the model weights, please edit the value of "quantization_config.config_groups.group_0.targets" from ["Linear"] to ["MoE"] in config.json of the original model to verify the quantized model.

{
  "quantization_config": {
    "config_groups": {
      "group_0": {
        "targets": [
          "MoE"
        ]
      }
    }
  }
}

Your model files should look like:

.
|-- chat_template.jinja
|-- config.json
|-- configuration_deepseek.py
|-- configuration.json
|-- generation_config.json
|-- model-00001-of-000062.safetensors
|-- ...
|-- model-00062-of-000062.safetensors
|-- model.safetensors.index.json
|-- modeling_deepseek.py
|-- tiktoken.model
|-- tokenization_kimi.py
|-- tokenizer_config.json

4 Installation¶

4.1 Docker Image Installation¶

You can use the official Docker image to run Kimi-K2-Thinking directly.

Select an image based on your machine type and start the Docker image on your node, refer to using docker.

   # Update the vllm-ascend image according to your environment.
   export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1-a3

# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/home/cache \
-it $IMAGE bash

Parameter Descriptions:

IMAGE: specifies the vllm-ascend image. The -a3 suffix selects the Atlas A3 image.
NAME: specifies the container name.
--net=host: uses host networking, so the vLLM service port is exposed on the host directly.
--shm-size=1g: configures container shared memory.
--device /dev/davinci[0-15]: exposes 16 Ascend NPU devices to the container.
--device /dev/davinci_manager, --device /dev/devmm_svm, and --device /dev/hisi_hdc: expose required Ascend runtime device files.
-v /usr/local/dcmi:/usr/local/dcmi: mounts DCMI tools for device management.
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi: mounts the NPU monitoring command.
-v /usr/local/Ascend/driver/*: mounts Ascend driver libraries and version files.
-v /etc/ascend_install.info:/etc/ascend_install.info: mounts Ascend installation metadata.
-v /mnt/sfs_turbo/.cache:/home/cache: mounts the shared model cache directory. Update it if you store model weights elsewhere.

After the container starts, run the following command on the host to verify the container status:

docker ps --filter name=vllm-ascend --format "table {{.Names}}\t{{.Status}}"

Expected Status:

The container name is vllm-ascend.
The status is Up ....
The container does not exit immediately.

Run the following command in the container to verify that Ascend devices are visible:

npu-smi info

Expected Status:

The command exits successfully.
The output lists the expected NPU devices.
Device health status is normal.

4.2 Source Code Installation¶

If you do not want to use the Docker image, you can also build from source:

# Install vLLM.
git clone --depth 1 --branch v0.22.1 https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install -e .
cd ..

# Install vLLM Ascend.
git clone --depth 1 --branch v0.22.1rc1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e .

To verify the source installation, run:

python -c "import vllm; import vllm_ascend; print('vllm and vllm_ascend import ok')"

Expected Status:

The command exits successfully.
vllm and vllm_ascend import ok is printed.

If you want to deploy a multi-node environment, set up the same software environment on each node.

5 Online Service Deployment¶

5.1 Single-Node Online Deployment¶

Single-node deployment completes both Prefill and Decode within the same node, suitable for online inference scenarios with moderate concurrency requirements.

For an Atlas 800 A3 (64G x 16) node, tensor-parallel-size should be at least 16.

Run the following script to start the vLLM server:

export HCCL_BUFFSIZE=1024
export TASK_QUEUE_ENABLE=1
export OMP_PROC_BIND=false
export HCCL_OP_EXPANSION_MODE=AIV
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SERVER_PORT=8000

vllm serve moonshotai/Kimi-K2-Thinking \
  --tensor-parallel-size 16 \
  --port $SERVER_PORT \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 12 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --enable-expert-parallel \
  --no-enable-prefix-caching

Parameter and Environment Variable Descriptions:

HCCL_BUFFSIZE=1024: configures the HCCL buffer size.
TASK_QUEUE_ENABLE=1: enables task queue scheduling.
OMP_PROC_BIND=false: avoids overly strict OpenMP CPU binding.
HCCL_OP_EXPANSION_MODE=AIV: enables the AIV communication path.
PYTORCH_NPU_ALLOC_CONF=expandable_segments:True: reduces NPU memory fragmentation.
SERVER_PORT: sets the service port. The generated script maps DEFAULT_PORT to 8000.
--tensor-parallel-size 16: uses 16-way tensor parallelism on the A3 node.
--max-model-len 8192: sets the maximum model context length.
--max-num-batched-tokens 8192: sets the maximum number of batched tokens.
--max-num-seqs 12: sets the maximum number of concurrent sequences.
--gpu-memory-utilization 0.9: controls the memory ratio used by vLLM.
--trust-remote-code: allows loading model-specific remote code.
--enable-expert-parallel: enables expert parallelism for MoE layers.
--no-enable-prefix-caching: disables prefix caching for a stable baseline.

Service Verification:

After the service starts, you should see logs similar to:

INFO:     Started server process [...]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Expected Status:

The server process starts successfully.
No error logs related to HCCL or NPU initialization.
The container does not exit immediately.

6 Functional Verification¶

After the service is started, the model can be invoked by sending a prompt:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "moonshotai/Kimi-K2-Thinking",
  "messages": [
    {"role": "user", "content": "Who are you?"}
  ],
  "temperature": 1.0
}'

Expected Result:

The HTTP status code is 200.
choices[0].message.content contains the generated assistant response.

7 Accuracy Evaluation¶

Using AISBench¶

For details, please refer to Using AISBench.

Using lm-eval¶

You can use lm-eval to evaluate the model accuracy through the OpenAI-compatible API.

For lm_eval installation, please refer to Using lm_eval.

Run lm_eval to execute the accuracy evaluation:

lm_eval \
  --model local-completions \
  --model_args model=moonshotai/Kimi-K2-Thinking,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
  --tasks gsm8k \
  --output_path ./

Reference configuration: gsm8k (5-shot), --apply_chat_template, --fewshot_as_multiturn, greedy decoding (temperature=0.0, top_p=1.0), max 2048 output tokens, batch size 1.

Below are reference gsm8k results for Kimi-K2-Thinking powered by vllm-ascend:v0.20.2rc1, evaluated on one Atlas 800 A3 node (64G × 16).

task	version	filter	n-shot	metric	value	stderr
`gsm8k`	3	`flexible-extract`	5	`exact_match`	0.8992	0.0083
`gsm8k`	3	`strict-match`	5	`exact_match`	0.8453	0.0100

8 Performance Evaluation¶

Refer to vllm benchmark for more details.

Test Command Example:

vllm bench serve \
  --backend openai-chat \
  --model moonshotai/Kimi-K2-Thinking \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 10 \
  --request-rate 1

After the benchmark completes, you can get the performance result, including request throughput, output token throughput, TTFT, TPOT, and ITL.

The following reference results are obtained with vllm-ascend:v0.20.2rc1 on one Atlas 800 A3 node (64G × 16), using OpenAI chat serving, random input/output lengths, 10 prompts, and --request-rate 1:

random input len	random output len	success	duration (s)	request throughput (req/s)	output throughput (tok/s)	total throughput (tok/s)	mean TTFT (ms)	mean TPOT (ms)	mean ITL (ms)
512	512	10 / 10	111.00	0.09	46.12	94.38	507.60	200.47	200.08
1024	1024	10 / 10	221.52	0.05	46.23	93.48	566.39	208.20	208.00
2048	2048	10 / 10	479.72	0.02	42.69	85.78	722.32	230.26	230.15

For a concurrency sweep, keep the input and output length fixed and vary --max-concurrency:

MODEL_NAME=moonshotai/Kimi-K2-Thinking
INPUT_LEN=1024
OUTPUT_LEN=1024

for CONCURRENCY in 1 2 4 8 16 32; do
  NUM_PROMPTS=$((CONCURRENCY * 10))
  vllm bench serve \
    --backend openai-chat \
    --model "$MODEL_NAME" \
    --endpoint /v1/chat/completions \
    --dataset-name random \
    --random-input-len "$INPUT_LEN" \
    --random-output-len "$OUTPUT_LEN" \
    --num-prompts "$NUM_PROMPTS" \
    --request-rate inf \
    --max-concurrency "$CONCURRENCY"
done

Reference results for 1024 input tokens and 1024 output tokens are:

max concurrency	prompts	success	duration (s)	request throughput (req/s)	output throughput (tok/s)	total throughput (tok/s)	mean TTFT (ms)	P99 TTFT (ms)	mean TPOT (ms)
1	10	10 / 10	595.07	0.02	17.21	34.80	473.71	712.49	57.71
2	20	20 / 20	623.88	0.03	32.83	66.35	708.16	996.59	60.29
4	40	40 / 40	725.38	0.06	56.47	114.13	956.11	1137.55	69.97
8	80	80 / 80	907.44	0.09	90.28	182.43	1361.85	1900.15	87.37
16	160	160 / 160	3093.07	0.05	52.97	107.04	76766.84	251245.22	222.07

Note: At concurrency levels of 16, the Mean TTFT increases significantly (76.7s), indicating severe queueing delay. For production deployment, it is recommended to limit concurrency based on your latency requirements or increase --max-num-seqs and --max-num-batched-tokens if NPU memory allows.

9 Performance Tuning¶

9.1 Recommended Configurations¶

Note: The following configurations are validated in specific test environments and are for reference only. The optimal configuration depends on factors such as maximum input/output length, prefix cache hit rate, precision requirements, and deployment machine ratios. It is recommended to refer to Section 9.2 for tuning based on actual conditions.

Table 1: Scenario Overview¶

Scenario	Deployment Mode	Total NPUs	Weight Version	Key Considerations
Long Context	Single-node	16 (A3)	bfloat16	Keep `--max-model-len` close to the real maximum input and output length; reduce `--max-num-seqs` first when memory pressure is high.
Low Latency	Single-node	16 (A3)	bfloat16	Reduce `--max-num-seqs` and `--max-num-batched-tokens` to reduce queueing delay.
High Throughput	Single-node	16 (A3)	bfloat16	Increase `--max-num-seqs` gradually and benchmark with a request rate close to the real workload.

Table 2: Detailed Recommendations¶

Long context: use tp16, keep --max-model-len close to the real maximum input and output length, and reduce --max-num-seqs first when memory pressure is high.
Low latency: reduce --max-num-seqs and --max-num-batched-tokens to reduce queueing delay.
High throughput: increase --max-num-seqs gradually and benchmark with a request rate close to the real workload. For long-context throughput tests, evaluate --decode-context-parallel-size as an optional tuning knob.
For 1024 input tokens and 1024 output tokens in the reference concurrency sweep, --max-concurrency 8 had the best output throughput. Higher concurrency increased TTFT significantly, so validate tail latency before using it in production.

Note:

--max-model-len and --max-num-seqs need to be set according to the actual usage scenario.

If the service runs under high concurrency, verify NPU health and HCCL status before increasing request rate.

9.2 Tuning Guidelines¶

9.2.1 General Tuning Reference¶

Please refer to the Public Performance Tuning Documentation for general tuning methods.

Please refer to the Feature Guide for detailed feature descriptions.

10 FAQ¶

For common environment, installation, and general parameter issues, please refer to the Public FAQ; this chapter only covers model-specific issues.

Q: API returns {"error":"Model not found"} or 404 when requesting with model: "Kimi-K2-Thinking"?

A: The server registers the model under its full path moonshotai/Kimi-K2-Thinking by default. When the request uses the short name Kimi-K2-Thinking without --served-model-name override, the server cannot resolve the model ID. Use "model": "moonshotai/Kimi-K2-Thinking" in requests, or start the server with --served-model-name Kimi-K2-Thinking to enable the short name.