MiniMax-M2.5#

Introduction#

MiniMax‑M2.5 is MiniMax’s flagship large language model, reinforced for high‑value scenarios such as code generation, agentic tool calling/search, and complex office workflows, with an emphasis on reasoning efficiency and end‑to‑end speed on challenging tasks.

This document provides a unified deployment guide for MiniMax-M2.5 on vLLM Ascend, covering both:

  • A3 single-node deployment (Atlas 800 A3)

  • A2 single-node deployment (Atlas 800I A2)

Supported Features#

Refer to supported features to get the model’s supported feature matrix.

Refer to feature guide to get the feature’s configuration.

Environment Preparation#

Model Weights#

It is recommended to download the model weights to a shared directory, such as /mnt/sfs_turbo/.cache/. The current release automatically detects the MiniMax-M2 fp8 checkpoint, disables fp8 quantization kernels on NPU, and loads the weights by dequantizing to bf16. This behavior may be removed once public bf16 weights are available.

Installation#

You can use the official docker image to run MiniMax-M2.5 directly.

Select an image based on your machine type and start the container on your node. See using docker.

Run with Docker#

A3 (single node)#

# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.21.0rc1
export NAME=vllm-ascend

# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/home/cache \
-it $IMAGE bash

A2 (single node)#

Create and run minimax25-docker-run.sh.

Notes:

  • The default configuration assumes an Atlas 800I A2 8-NPU node and sets ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7. Update it based on your hardware.

  • Map your model weight directory into the container (the example maps it to /opt/data/verification/).

#!/bin/sh
NAME=minimax2_5
DEVICES="0,1,2,3,4,5,6,7"
IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|

docker run -itd -u 0 --ipc=host --privileged \
  -e VLLM_USE_MODELSCOPE=True \
  -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
  -e ASCEND_RT_VISIBLE_DEVICES=$DEVICES \
  --name $NAME \
  --net=host \
  --device /dev/davinci_manager \
  --device /dev/devmm_svm \
  --device /dev/hisi_hdc \
  --shm-size=1200g \
  -v /usr/local/dcmi:/usr/local/dcmi \
  -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
  -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
  -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
  -v /etc/ascend_install.info:/etc/ascend_install.info \
  -v /home/:/home/ \
  -v /opt/data/verification/:/opt/data/verification/ \
  -v /root/.cache:/root/.cache \
  -v /mnt/performance/:/mnt/performance/ \
  -it $IMAGE bash

# Start and enter the container
# bash minimax25-docker-run.sh
# docker exec -it minimax2_5 bash

Online Inference on Multi-NPU#

A3 (single node)#

Below is a recommended startup configuration for short-context condition like 3.5k/1.5k to reach a good performance.

Notes:

  • If you only care about short-context low latency, you can explicitly set --max-model-len 32768. You may also set tensor-parallel-size to 16 and set data-parallel-size to 1.

  • export VLLM_ASCEND_BALANCE_SCHEDULING=1 is used to enhance scheduling capacity between prefill and decode. This will work remarkably with a larger data-parallel-size. This can increace performance when concurrency gets closer to values equals to data-parallel-size times max-num-seqs.

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=1024
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export OMP_NUM_THREADS=1
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl kernel.sched_migration_cost_ns=50000
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export TASK_QUEUE_ENABLE=1

export VLLM_ASCEND_ENABLE_FUSED_MC2=1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export VLLM_ASCEND_BALANCE_SCHEDULING=1

vllm serve /path/to/weight/MiniMax-M2.5-w8a8-QuaRot \
    --served-model-name "MiniMax-M2.5" \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --quantization ascend \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
    --additional-config '{"enable_cpu_binding":true}' \
    --enable-expert-parallel \
    --tensor-parallel-size 4 \
    --data-parallel-size 4 \
    --max-num-seqs 48 \
    --max-model-len 40690 \
    --max-num-batched-tokens 16384 \
    --gpu-memory-utilization 0.85 \
    --speculative_config '{"enforce_eager": true, "method": "eagle3", "model": "/path/to/weight/Eagle3/", "num_speculative_tokens": 3}' \

Remarks:

  • minimax_m2_append_think keeps <think>...</think> inside content.

  • If you mainly rely on the reasoning semantics of /v1/responses, it is recommended to use --reasoning-parser minimax_m2 instead.

  • To receive a better performance on long-context like 128k or 64k, we recommend to do changes as shown below, and you can remove export VLLM_ASCEND_BALANCE_SCHEDULING=1.

    --tensor-parallel-size 8 \
    --data-parallel-size 1 \
    --decode-context-parallel-size 1 \
    --prefill-context-parallel-size 2 \
    --cp-kv-cache-interleave-size 128 \
    --max-num-seqs 16 \
    --max-model-len 138000 \
    --max-num-batched-tokens 65536 \
    --gpu-memory-utilization 0.85 \
    --speculative_config '{"enforce_eager": true, "method": "eagle3", "model": "/path/to/weight/Eagle3/", "num_speculative_tokens": 1}' \
  • If you will to test with curl command, you can add following commands addition to start up command above.

    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \

A2 (single node)#

export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=512
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl kernel.sched_migration_cost_ns=50000
export TASK_QUEUE_ENABLE=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1

vllm serve /path/to/weight/MiniMax-M2.5-w8a8-QuaRot \
    --served-model-name MiniMax-M2.5 \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --quantization ascend \
    --enable-expert-parallel \
    --max-num-seqs 32 \
    --seed 1024 \
    --max-num-batched-tokens 32768 \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY","cudagraph_capture_sizes":[4,16,20,32,80,96,128,200,256,320]}' \
    --gpu-memory-utilization 0.9 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --enable-force-include-usage \
    --additional-config '{"enable_cpu_binding":true}' \
    --model-loader-extra-config '{"enable_multithread_load":true,"num_threads":16}' \
    --speculative_config '{"method": "eagle3", "model": "/path/to/weight/Eagle3/",  "num_speculative_tokens":3}'

Remarks:

  • --max-num-seqs parameter can be adjusted according to actual request conditions.

  • --max-num-batched-tokens 32768 is applicable to the input sequence length of 32k or longer.

  • --max-num-batched-tokens 16384 is applicable to the input sequence length of 16k.

  • --max-num-batched-tokens 6144 is applicable to short sequence input scenarios such as 2k and 3.5k.

Verify the Service#

A3 (single node)#

Test with an OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="na")

resp = client.chat.completions.create(
    model="MiniMax-M2.5",
    messages=[{"role": "user", "content": "你好,请介绍一下你自己,并展示一次工具调用的参数格式。"}],
    max_tokens=256,
)
print(resp.choices[0].message.content)

Or send a request using curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMax-M2.5",
    "messages": [{"role": "user", "content": "请查询上海的天气。"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get weather by city",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {"type": "string"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
          },
          "required": ["city"]
        }
      }
    }],
    "tool_choice": "auto",
    "temperature": 0,
    "max_tokens": 512
  }'

A2 (single node)#

Run the following from any machine that can reach the service node (replace {NodeIP} with the real IP):

curl http://{NodeIP}:8000/v1/chat/completions \
  -H "Content-type: application/json" \
  -d '{
    "model": "MiniMax-M2.5",
    "messages": [{"role": "user", "content": "Hello, who are you?"}],
    "stream": false,
    "ignore_eos": true,
    "temperature": 0.8,
    "top_p": 0.8,
    "max_tokens": 200
  }'

FAQ#

  • Q: What should I do if the output is garbled in EP mode?

    A: It is recommended to keep --enable-expert-parallel and VLLM_ASCEND_ENABLE_FLASHCOMM1=1.

  • Q: Why is the reasoning field often empty after using minimax_m2_append_think?

    A: This is expected. The parser keeps <think>...</think> inside content. If you mainly rely on the reasoning semantics of /v1/responses, use --reasoning-parser minimax_m2 instead.

  • Q: Startup fails with HCCL port conflicts (address already bound). What should I do?

    A: Clean up old processes and restart: pkill -f "vllm serve /models/MiniMax-M2.5".

  • Q: How to handle OOM or unstable startup?

    A: Reduce --max-num-seqs and --max-num-batched-tokens first. If needed, reduce concurrency and load-testing pressure (e.g., max-concurrency / num-prompts).

  • Q: How should I choose --reasoning-parser?

    A: This guide uses minimax_m2_append_think so that <think>...</think> is kept in content. If you mainly rely on the reasoning semantics of /v1/responses, consider using --reasoning-parser minimax_m2.

  • Q: Which ports must be accessible?

    A: At minimum, expose the serving port (e.g., 8000)