Kimi-K2-Thinking#

1 Introduction#

Kimi-K2-Thinking is a large-scale Mixture-of-Experts (MoE) model developed by Moonshot AI. It features a hybrid thinking architecture that excels in complex reasoning and problem-solving tasks.

This document will demonstrate the main verification steps of the model, including supported features, environment preparation, installation, online service deployment, functional verification, accuracy evaluation, performance evaluation, performance tuning, and FAQ.

This document is recommended to use the latest release candidate or official version.

2 Supported Features#

Refer to supported features to get the model’s supported feature matrix.

Refer to feature guide to get the feature’s configuration.

3 Prerequisites#

3.1 Model Weight#

It is recommended to download the model weight to the shared directory, such as /mnt/sfs_turbo/.cache/.

After downloading the model weights, please edit the value of "quantization_config.config_groups.group_0.targets" from ["Linear"] to ["MoE"] in config.json of the original model to verify the quantized model.

{
  "quantization_config": {
    "config_groups": {
      "group_0": {
        "targets": [
          "MoE"
        ]
      }
    }
  }
}

Your model files should look like:

.
|-- chat_template.jinja
|-- config.json
|-- configuration_deepseek.py
|-- configuration.json
|-- generation_config.json
|-- model-00001-of-000062.safetensors
|-- ...
|-- model-00062-of-000062.safetensors
|-- model.safetensors.index.json
|-- modeling_deepseek.py
|-- tiktoken.model
|-- tokenization_kimi.py
|-- tokenizer_config.json

4 Installation#

4.1 Docker Image Installation#

You can use the official Docker image to run Kimi-K2-Thinking directly.

Select an image based on your machine type and start the Docker image on your node, refer to using docker.

   # Update the vllm-ascend image according to your environment.
   export IMAGE=quay.io/ascend/vllm-ascend:v0.22.1rc1-a3

# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/home/cache \
-it $IMAGE bash

Parameter Descriptions:

  • IMAGE: specifies the vllm-ascend image. The -a3 suffix selects the Atlas A3 image.

  • NAME: specifies the container name.

  • --net=host: uses host networking, so the vLLM service port is exposed on the host directly.

  • --shm-size=1g: configures container shared memory.

  • --device /dev/davinci[0-15]: exposes 16 Ascend NPU devices to the container.

  • --device /dev/davinci_manager, --device /dev/devmm_svm, and --device /dev/hisi_hdc: expose required Ascend runtime device files.

  • -v /usr/local/dcmi:/usr/local/dcmi: mounts DCMI tools for device management.

  • -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi: mounts the NPU monitoring command.

  • -v /usr/local/Ascend/driver/*: mounts Ascend driver libraries and version files.

  • -v /etc/ascend_install.info:/etc/ascend_install.info: mounts Ascend installation metadata.

  • -v /mnt/sfs_turbo/.cache:/home/cache: mounts the shared model cache directory. Update it if you store model weights elsewhere.

After the container starts, run the following command on the host to verify the container status:

docker ps --filter name=vllm-ascend --format "table {{.Names}}\t{{.Status}}"

Expected Status:

  • The container name is vllm-ascend.

  • The status is Up ....

  • The container does not exit immediately.

Run the following command in the container to verify that Ascend devices are visible:

npu-smi info

Expected Status:

  • The command exits successfully.

  • The output lists the expected NPU devices.

  • Device health status is normal.

4.2 Source Code Installation#

If you do not want to use the Docker image, you can also build from source:

# Install vLLM.
git clone --depth 1 --branch v0.22.1 https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install -e .
cd ..

# Install vLLM Ascend.
git clone --depth 1 --branch v0.22.1rc1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e .

To verify the source installation, run:

python -c "import vllm; import vllm_ascend; print('vllm and vllm_ascend import ok')"

Expected Status:

  • The command exits successfully.

  • vllm and vllm_ascend import ok is printed.

If you want to deploy a multi-node environment, set up the same software environment on each node.

5 Online Service Deployment#

5.1 Single-Node Online Deployment#

Single-node deployment completes both Prefill and Decode within the same node, suitable for online inference scenarios with moderate concurrency requirements.

For an Atlas 800 A3 (64G x 16) node, tensor-parallel-size should be at least 16.

Run the following script to start the vLLM server:

export HCCL_BUFFSIZE=1024
export TASK_QUEUE_ENABLE=1
export OMP_PROC_BIND=false
export HCCL_OP_EXPANSION_MODE=AIV
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SERVER_PORT=8000

vllm serve moonshotai/Kimi-K2-Thinking \
  --tensor-parallel-size 16 \
  --port $SERVER_PORT \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 12 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --enable-expert-parallel \
  --no-enable-prefix-caching

Parameter and Environment Variable Descriptions:

  • HCCL_BUFFSIZE=1024: configures the HCCL buffer size.

  • TASK_QUEUE_ENABLE=1: enables task queue scheduling.

  • OMP_PROC_BIND=false: avoids overly strict OpenMP CPU binding.

  • HCCL_OP_EXPANSION_MODE=AIV: enables the AIV communication path.

  • PYTORCH_NPU_ALLOC_CONF=expandable_segments:True: reduces NPU memory fragmentation.

  • SERVER_PORT: sets the service port. The generated script maps DEFAULT_PORT to 8000.

  • --tensor-parallel-size 16: uses 16-way tensor parallelism on the A3 node.

  • --max-model-len 8192: sets the maximum model context length.

  • --max-num-batched-tokens 8192: sets the maximum number of batched tokens.

  • --max-num-seqs 12: sets the maximum number of concurrent sequences.

  • --gpu-memory-utilization 0.9: controls the memory ratio used by vLLM.

  • --trust-remote-code: allows loading model-specific remote code.

  • --enable-expert-parallel: enables expert parallelism for MoE layers.

  • --no-enable-prefix-caching: disables prefix caching for a stable baseline.

Service Verification:

After the service starts, you should see logs similar to:

INFO:     Started server process [...]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Expected Status:

  • The server process starts successfully.

  • No error logs related to HCCL or NPU initialization.

  • The container does not exit immediately.

6 Functional Verification#

After the service is started, the model can be invoked by sending a prompt:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "moonshotai/Kimi-K2-Thinking",
  "messages": [
    {"role": "user", "content": "Who are you?"}
  ],
  "temperature": 1.0
}'

Expected Result:

  • The HTTP status code is 200.

  • choices[0].message.content contains the generated assistant response.

7 Accuracy Evaluation#

Using AISBench#

For details, please refer to Using AISBench.

Using lm-eval#

You can use lm-eval to evaluate the model accuracy through the OpenAI-compatible API.

For lm_eval installation, please refer to Using lm_eval.

Run lm_eval to execute the accuracy evaluation:

lm_eval \
  --model local-completions \
  --model_args model=moonshotai/Kimi-K2-Thinking,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
  --tasks gsm8k \
  --output_path ./

Reference configuration: gsm8k (5-shot), --apply_chat_template, --fewshot_as_multiturn, greedy decoding (temperature=0.0, top_p=1.0), max 2048 output tokens, batch size 1.

Below are reference gsm8k results for Kimi-K2-Thinking powered by vllm-ascend:v0.20.2rc1, evaluated on one Atlas 800 A3 node (64G × 16).

task

version

filter

n-shot

metric

value

stderr

gsm8k

3

flexible-extract

5

exact_match

0.8992

0.0083

gsm8k

3

strict-match

5

exact_match

0.8453

0.0100

8 Performance Evaluation#

Refer to vllm benchmark for more details.

Test Command Example:

vllm bench serve \
  --backend openai-chat \
  --model moonshotai/Kimi-K2-Thinking \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 10 \
  --request-rate 1

After the benchmark completes, you can get the performance result, including request throughput, output token throughput, TTFT, TPOT, and ITL.

The following reference results are obtained with vllm-ascend:v0.20.2rc1 on one Atlas 800 A3 node (64G × 16), using OpenAI chat serving, random input/output lengths, 10 prompts, and --request-rate 1:

random input len

random output len

success

duration (s)

request throughput (req/s)

output throughput (tok/s)

total throughput (tok/s)

mean TTFT (ms)

mean TPOT (ms)

mean ITL (ms)

512

512

10 / 10

111.00

0.09

46.12

94.38

507.60

200.47

200.08

1024

1024

10 / 10

221.52

0.05

46.23

93.48

566.39

208.20

208.00

2048

2048

10 / 10

479.72

0.02

42.69

85.78

722.32

230.26

230.15

For a concurrency sweep, keep the input and output length fixed and vary --max-concurrency:

MODEL_NAME=moonshotai/Kimi-K2-Thinking
INPUT_LEN=1024
OUTPUT_LEN=1024

for CONCURRENCY in 1 2 4 8 16 32; do
  NUM_PROMPTS=$((CONCURRENCY * 10))
  vllm bench serve \
    --backend openai-chat \
    --model "$MODEL_NAME" \
    --endpoint /v1/chat/completions \
    --dataset-name random \
    --random-input-len "$INPUT_LEN" \
    --random-output-len "$OUTPUT_LEN" \
    --num-prompts "$NUM_PROMPTS" \
    --request-rate inf \
    --max-concurrency "$CONCURRENCY"
done

Reference results for 1024 input tokens and 1024 output tokens are:

max concurrency

prompts

success

duration (s)

request throughput (req/s)

output throughput (tok/s)

total throughput (tok/s)

mean TTFT (ms)

P99 TTFT (ms)

mean TPOT (ms)

1

10

10 / 10

595.07

0.02

17.21

34.80

473.71

712.49

57.71

2

20

20 / 20

623.88

0.03

32.83

66.35

708.16

996.59

60.29

4

40

40 / 40

725.38

0.06

56.47

114.13

956.11

1137.55

69.97

8

80

80 / 80

907.44

0.09

90.28

182.43

1361.85

1900.15

87.37

16

160

160 / 160

3093.07

0.05

52.97

107.04

76766.84

251245.22

222.07

Note: At concurrency levels of 16, the Mean TTFT increases significantly (76.7s), indicating severe queueing delay. For production deployment, it is recommended to limit concurrency based on your latency requirements or increase --max-num-seqs and --max-num-batched-tokens if NPU memory allows.

9 Performance Tuning#

9.2 Tuning Guidelines#

9.2.1 General Tuning Reference#

Please refer to the Public Performance Tuning Documentation for general tuning methods.

Please refer to the Feature Guide for detailed feature descriptions.

10 FAQ#

For common environment, installation, and general parameter issues, please refer to the Public FAQ; this chapter only covers model-specific issues.

  • Q: API returns {"error":"Model not found"} or 404 when requesting with model: "Kimi-K2-Thinking"?

    A: The server registers the model under its full path moonshotai/Kimi-K2-Thinking by default. When the request uses the short name Kimi-K2-Thinking without --served-model-name override, the server cannot resolve the model ID. Use "model": "moonshotai/Kimi-K2-Thinking" in requests, or start the server with --served-model-name Kimi-K2-Thinking to enable the short name.