KV 缓存池(Ascend Store)部署指南#
目录#
环境依赖#
软件要求:
CANN >= 8.5.0
vLLM:main 分支
vLLM-Ascend:main 分支
mooncake:>= 0.3.9
KV 池参数说明#
kv_load_failure_policy:KV 加载失败处理策略#
kv_load_failure_policy 是 kv-transfer-config 中的顶级字段。
recompute:当 KV 加载失败时,vLLM 将请求回滚到最后一个有效前缀,并重新调度以重新计算失败的 KV 块。混合注意力模型(例如 DeepSeekV4、Qwen 3.5)尚不支持。fail:当 KV 加载失败时,受影响的请求将直接终止并返回错误。
vLLM 中的默认值为 fail。如果希望在 KV 加载失败后请求回退到重新计算,请将其设置为 recompute。
使用 MultiConnector 时,请在 MultiConnector 顶级 kv-transfer-config 上配置 kv_load_failure_policy,而不是在子连接器上配置。
kv_connector_extra_config:池化的额外可配置参数#
参数 |
描述 |
|---|---|
|
池化调度器进程与工作进程间 RPC 通信的端口:每个实例需要配置唯一的端口。 |
|
是否启用异步加载。默认值为 false。 |
|
设置 kvpool 的存储后端 ( |
|
解码节点是否将 KV 缓存放入 KV 池。默认值为 false。 |
|
解码节点是否从 KV 池加载 KV 缓存。默认值为 false。 |
|
Prefill PP 大小,当 Prefill 节点启用 PP 时需要设置。 |
|
Prefill PP 层划分,当 Prefill 节点启用 PP 时需要设置。 |
环境变量配置#
为保证生成的哈希值一致,在启用 KV 池时,必须在所有节点上同步 PYTHONHASHSEED 环境变量。
export PYTHONHASHSEED=0
使用 Mooncake 作为 KV 池后端的示例#
软件要求:
检查 NPU HCCN 配置:
确保环境中存在
hccn.conf文件。如果使用 Docker,请将其挂载到容器中。cat /etc/hccn.conf安装 Mooncake
Mooncake 是 Moonshot AI 提供的领先 LLM 服务 Kimi 的推理平台。 安装与编译指南:kvcache-ai/Mooncake。 首先,我们需要获取 Mooncake 项目。参考以下命令:
git clone -b v0.3.9 --depth 1 https://github.com/kvcache-ai/Mooncake.git
(可选)如果网络较差,请更换 go install 的 URL
cd Mooncake sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh
安装 MPI
apt-get install mpich libmpich-dev -y
安装相关依赖。不需要安装 Go 语言。
bash dependencies.sh -y
编译与安装
mkdir build cd build cmake .. -DUSE_ASCEND_DIRECT=ON make -j make install
设置环境变量
注意:
根据您具体的 Python 安装路径调整 Python 路径
确保
/usr/local/lib和/usr/local/lib64已包含在您的LD_LIBRARY_PATH中
export LD_LIBRARY_PATH=/usr/local/lib64/python3.12/site-packages/mooncake:$LD_LIBRARY_PATH
环境变量说明#
硬件 |
依赖项 |
导出命令 |
描述 |
|---|---|---|---|
800 I/T A3 系列 |
HDK >= 26.0 |
|
Recommended. Enables unified memory address direct transmission scheme. With SSD offload, see Fabric memory size alignment — memory sizes must be aligned to 1GB. |
800 I/T A3 系列 |
如果上述任一依赖项不满足 |
|
配置 NPU 设备上用于聚合和 KV 传输的缓冲区数量和大小(例如, |
800 I/T A2 系列 |
建议使用 HDK >= 25.5 |
|
800 I/T A2 系列直接传输方案所需 |
运行 Mooncake Master#
Note: Before proceeding, review the following Mooncake guides:
1.配置 mooncake.json#
环境变量 MOONCAKE_CONFIG_PATH 应配置为 mooncake.json 所在的完整路径。
{
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
"device_name": "",
"master_server_address": "xx.xx.xx.xx:50088",
"global_segment_size": "1GB" (1024MB/1048576KB/1073741824B/1073741824),
"preferred_segment": false,
"prefer_alloc_in_same_node": true
}
metadata_server:配置为 P2PHANDSHAKE。 protocol: 在 NPU 上必须设置为 'Ascend'。 device_name:"" master_server_address:配置为 master 服务的 IP 和端口。也可以通过 MOONCAKE_MASTER 环境变量设置,该变量优先级高于此配置项(适用于通过 Kubernetes 注入 master 地址)。 global_segment_size:每张卡注册到 KV 池的内存大小。需要按 1GB 对齐。 也可以通过 MOONCAKE_GLOBAL_SEGMENT_SIZE 环境变量设置,该变量优先级高于此配置项。 preferred_segment:向 KV 池放入对象时是否优先存储在本地段上。默认为 false。 prefer_alloc_in_same_node:是否优先在同一节点上分配 KV。默认为 true。
2.启动 mooncake_master#
在 mooncake 文件夹下:
mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.1 --default_kv_lease_ttl 11000
eviction_high_watermark_ratio determines the watermark where Mooncake Store will perform eviction, and eviction_ratio determines the portion of stored objects that would be evicted.
default_kv_lease_ttl controls the default lease TTL for KV objects (milliseconds); configure it via --default_kv_lease_ttl and keep it larger than ASCEND_CONNECT_TIMEOUT and ASCEND_TRANSFER_TIMEOUT.
PD 分离(Prefill-Decode Disaggregation)场景#
1.运行 prefill 节点和 decode 节点#
使用 MultiConnector 同时利用 MooncakeConnectorV1 和 AscendStoreConnector。MooncakeConnectorV1 负责 kv_transfer(KV 传输),而 AscendStoreConnector 作为前缀缓存(prefix-cache)节点。
prefill Node:
bash multi_producer.sh
multi_producer.sh 脚本的内容:
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONHASHSEED=0
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ACL_OP_INIT_MODE=1
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1
#Minimum retransmission timeout of the RDMA, equals 4.096 μs * 2 ^ timeout.
#Needs to satisfy the equation: ASCEND_TRANSFER_TIMEOUT > RDMA_TIMEOUT * 7, where 7 is the default number of retry for RDMA transfer.
#HCCL_RDMA_TIMEOUT also affects collective communication behavior and should be configured carefully.
export HCCL_RDMA_TIMEOUT=17
# Unit: ms. The timeout for one-sided communication connection establishment is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039). Users can adjust this value based on their specific setup.
# The recommended formula is: ASCEND_CONNECT_TIMEOUT = connection_time_per_card (typically within 500ms) × total_number_of_Decode_cards.
# This ensures that even in the worst-case scenario—where all Decode cards simultaneously attempt to connect to the same Prefill card the connection will not time out.
export ASCEND_CONNECT_TIMEOUT=10000
# Unit: ms. The timeout for one-sided communication transfer is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039).
export ASCEND_TRANSFER_TIMEOUT=10000
python3 -m vllm.entrypoints.openai.api_server \
--model /xxxxx/Qwen2.5-7B-Instruct \
--port 8100 \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 32768 \
--block-size 128 \
--max-num-batched-tokens 16384 \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
"kv_role": "kv_producer",
"kv_load_failure_policy": "recompute",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "20001",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 1
},
"decode": {
"dp_size": 1,
"tp_size": 1
}
}
},
{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"lookup_rpc_port":"0",
"backend": "mooncake"
}
}
]
}
}'
decode 节点:
bash multi_consumer.sh
multi_consumer.sh 的内容:
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export PYTHONHASHSEED=0
export MOONCAKE_CONFIG_PATH="/xxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
export ACL_OP_INIT_MODE=1
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1
export HCCL_RDMA_TIMEOUT=17
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000
python3 -m vllm.entrypoints.openai.api_server \
--model /xxxxx/Qwen2.5-7B-Instruct \
--port 8200 \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 32768 \
--block-size 128 \
--max-num-batched-tokens 16384 \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
"kv_role": "kv_consumer",
"kv_load_failure_policy": "recompute",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "20002",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 1
},
"decode": {
"dp_size": 1,
"tp_size": 1
}
}
},
{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"lookup_rpc_port":"0",
"backend": "mooncake"
}
}
]
}
}'
目前,PD 解耦场景中的键值池默认仅存储由 Prefill 节点生成的 KV 缓存。在使用 MLA 的模型中,现已支持 Decode 节点存储 KV 缓存供 Prefill 节点使用,可通过在 AscendStoreConnector 中添加 consumer_is_to_put: true 来启用。如果 Prefill 节点启用了流水线并行(PP),则还需要设置 prefill_pp_size 或 prefill_pp_layer_partition。示例如下:
{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_consumer",
"kv_load_failure_policy": "recompute",
"kv_connector_extra_config": {
"lookup_rpc_port": "0",
"backend": "mooncake",
"consumer_is_to_put": true,
"prefill_pp_size": 2,
"prefill_pp_layer_partition": "30,31"
}
}
2.启动代理服务器#
python vllm-ascend/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py \
--host localhost \
--prefiller-hosts localhost \
--prefiller-ports 8100 \
--decoder-hosts localhost \
--decoder-ports 8200 \
将 localhost 更改为您的实际 IP 地址。
3. Run Inference#
在命令中将 localhost、端口和模型权重路径配置为您自己的设置。
短问题:
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'
长问题:
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'
PD 混部推理#
1.运行混部脚本#
bash pd_mix.sh
pd_mix.sh 的内容:
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export PYTHONHASHSEED=0
export ACL_OP_INIT_MODE=1
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1
export HCCL_RDMA_TIMEOUT=17
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000
python3 -m vllm.entrypoints.openai.api_server \
--model /xxxxx/Qwen2.5-7B-Instruct \
--port 8100 \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 32768 \
--block-size 128 \
--max-num-batched-tokens 16384 \
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_both",
"kv_load_failure_policy": "recompute",
"kv_connector_extra_config": {
"lookup_rpc_port":"1",
"backend": "mooncake"
}
}' > mix.log 2>&1
2.运行推理#
在命令中将 localhost、端口和模型权重路径配置为您自己的设置。发送的请求将仅到达混部脚本所在的端口,无需启动单独的代理。
短问题:
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'
长问题:
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'
注意:对于启用了 ASCEND_BUFFER_POOL 的 MooncakeStore,建议在实际性能基准测试前进行预热阶段。
这是因为当涉及设备到设备通信时,HCCL 单边通信连接是在实例启动后延迟创建的。目前,需要在所有设备之间建立全连接。建立这些连接会引入一次性时间开销和持续的设备内存消耗(每个连接消耗 4 MB 设备内存)。
对于预热,建议发送输入序列长度为 8K、输出序列长度为 1 的请求,请求总数应为设备(卡/芯片)数量的 2-3 倍。
使用嵌入式真实客户端模式启用 MooncakeStore SSD 卸载#
此功能需要 mooncake >= v0.3.11。
启动 mooncake_master#
按照运行 Mooncake Master 中的描述启动 Mooncake master。要启用 SSD 卸载,请在相同的 master 启动命令中添加 --enable_offload=true。例如:
mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.1 --default_kv_lease_ttl 11000 --enable_offload=true --client_ttl=120
字段 |
描述 |
|---|---|
|
设置为 |
|
Seconds a client stays alive after the last Ping. CLI default is |
配置#
从运行 Mooncake Master 中配置的 mooncake.json 开始,添加以下 SSD 卸载字段:
{
"enable_ssd_offload": true,
"ssd_offload_path": "/nvme/mooncake_offload"
}
字段 |
描述 |
|---|---|
|
设置为 |
|
当 |
运行嵌入式真实客户端#
在模式 A(嵌入式真实客户端)下,Mooncake 嵌入在 vLLM 中。当 vLLM 服务启动时,AscendStoreConnector / MooncakeBackend 会使用 mooncake.json 中的设置(包括启用 SSD 卸载时的 enable_ssd_offload 和 ssd_offload_path)自动调用 MooncakeDistributedStore.setup()。无需单独的 mooncake_client 进程。
SSD 磁盘使用控制#
以下环境变量控制 SSD 卸载(bucket 后端)的磁盘空间使用:
环境变量 |
默认值 |
描述 |
|---|---|---|
|
|
Per-rank SSD read/write buffer size in bytes. Not configurable in |
|
|
以字节为单位的驱逐阈值。当设置为 |
|
|
驱逐策略: |
|
|
Per-rank maximum disk usage reported to Mooncake master. Master aggregates this across clients (roughly 2 TB × rank count in the |
MOONCAKE_OFFLOAD_TOTAL_SIZE_LIMIT_BYTES risk: If left at the 2 TB default, master shows a total SSD quota far larger than the physical disk (e.g. 16 ranks → ~32 TB displayed on a 1 TB NVMe). Offload still fails when the disk fills, while monitoring looks healthy. Set this to your actual per-rank budget before production use.
由于每个 TP rank 在 ssd_offload_path 下使用独立的 SSD 子目录(rank_0/、rank_1/、...),所有 rank 共享同一物理磁盘。为防止单个 rank 消耗过多空间,请设置显式的每 rank 配额。例如,对于 800 GB 磁盘和 8 个 TP rank:
# 800 GB total disk, 8 ranks, ~100 GB per rank
export MOONCAKE_OFFLOAD_TOTAL_SIZE_LIMIT_BYTES=$((100 * 1024 * 1024 * 1024))
export MOONCAKE_OFFLOAD_BUCKET_MAX_TOTAL_SIZE=$((100 * 1024 * 1024 * 1024))
export MOONCAKE_OFFLOAD_BUCKET_EVICTION_POLICY=lru
export MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES=1073741824 # 1 GB
使用 Memcache 作为 KV 池后端的示例#
安装 Memcache#
MemCache depends on MemFabric. Therefore, MemFabric must be installed. Installing the memcache after the memfabric is installed.
pip install memfabric-hybrid
pip install memcache-hybrid
配置 Memcache 配置文件#
mmc-meta.conf:
ock.mmc.meta_service_url = tcp://xx.xx.xx.xx:5000
ock.mmc.meta_service.config_store_url = tcp://xx.xx.xx.xx:6000
ock.mmc.log_level = error
mmc-local.conf:
ock.mmc.meta_service_url = tcp://xx.xx.xx.xx:5000
ock.mmc.local_service.config_store_url = tcp://xx.xx.xx.xx:6000
ock.mmc.log_level = error
ock.mmc.local_service.world_size = 256
ock.mmc.local_service.protocol = device_sdma
ock.mmc.local_service.dram.size = 1GB
关键要点:
参数 |
描述 |
|---|---|
|
配置主节点的 IP 地址和端口号。P 节点和 D 节点的 IP 地址和端口号可以相同。 |
|
配置主节点的 IP 地址和端口号。P 节点和 D 节点的 IP 地址和端口号可以相同。 |
|
本地服务的总数,包括未来将添加的服务。 |
|
|
|
设置主节点占用的内存大小。配置的值是每张卡占用的内存大小。 |
运行 Memcache Master#
启动 MetaService 服务。
export MMC_META_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-meta.conf
python -c "from memcache_hybrid import MetaService; MetaService.main()"
PD 分离(Prefill-Decode Disaggregation)场景#
1.运行 prefill 节点和 decode 节点#
使用 MultiConnector 同时利用 MooncakeConnectorV1 和 AscendStoreConnector。MooncakeConnectorV1 执行 kv_transfer,而 AscendStoreConnector 启用 KV 缓存池。
800I A2/800T A2/800I A3/800T A3 系列#
run_prefill.sh/run_decode.sh:
#!/bin/bash
ROLE="prefill" # prefill / decode
HARDWARE_SERIES="A2" # A2 (800I/800T A2) or A3 (800I/800T A3)
LOCAL_IP="xx.xx.xx.xx"
NIC_NAME="xxxxxx"
MODEL_PATH="xxxxxxx/Qwen3-32B"
SERVED_MODEL_NAME="qwen3"
DATA_PARALLEL_SIZE=1
TENSOR_PARALLEL_SIZE=8
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf
if [ "$ROLE" == "prefill" ]; then
KV_ROLE="kv_producer"
KV_PORT="20001"
LOOKUP_RPC_PORT="0"
else
KV_ROLE="kv_consumer"
KV_PORT="20002"
LOOKUP_RPC_PORT="1"
fi
echo "Starting vLLM on Series: $HARDWARE_SERIES, Role: $ROLE"
rm -rf /root/ascend/log/*
rm -rf ./connector.log
if [ "$HARDWARE_SERIES" == "A2" ]; then
echo 200000 > /proc/sys/vm/nr_hugepages
export HCCL_IF_IP=$LOCAL_IP
export GLOO_SOCKET_IFNAME=$NIC_NAME
export TP_SOCKET_IFNAME=$NIC_NAME
export HCCL_SOCKET_IFNAME=$NIC_NAME
elif [ "$HARDWARE_SERIES" == "A3" ]; then
export ACL_OP_INIT_MODE=1
else
echo "Error: Invalid HARDWARE_SERIES. Set to 'A2' or 'A3'."
exit 1
fi
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
KV_CONFIG='{
"kv_connector": "MultiConnector",
"kv_role": "'$KV_ROLE'",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "'$KV_ROLE'",
"kv_port": "'$KV_PORT'",
"kv_connector_extra_config": {
"prefill": {
"dp_size": '$DATA_PARALLEL_SIZE',
"tp_size": '$TENSOR_PARALLEL_SIZE'
},
"decode": {
"dp_size": '$DATA_PARALLEL_SIZE',
"tp_size": '$TENSOR_PARALLEL_SIZE'
}
}
},
{
"kv_connector": "AscendStoreConnector",
"kv_role": "'$KV_ROLE'",
"kv_connector_extra_config": {
"backend": "memcache",
"lookup_rpc_port": "'$LOOKUP_RPC_PORT'"
}
}
]
}
}'
CMD_ARGS=(
--model "$MODEL_PATH"
--served-model-name "$SERVED_MODEL_NAME"
--trust-remote-code
--enforce-eager
--data-parallel-size "$DATA_PARALLEL_SIZE"
--tensor-parallel-size "$TENSOR_PARALLEL_SIZE"
--port 30050
--max-num_seqs 20
--max-model-len 32768
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.9
--kv-transfer-config "$KV_CONFIG"
)
python -m vllm.entrypoints.openai.api_server "${CMD_ARGS[@]}" > log_${ROLE}.log 2>&1
echo "vLLM started. Log file: log_${ROLE}.log"
2.启动代理服务器#
Refer to Start proxy_server in the MooncakeStore deployment section.
3. Run Inference#
Refer to Run Inference in the MooncakeStore deployment section.
PD 混部场景#
1.运行混部脚本#
800I A2/800T A2/800I A3/800T A3 系列#
Run_pd_mix.sh:
#!/bin/bash
HARDWARE_SERIES="A2" # A2 (800I/800T A2) or A3 (800I/800T A3)
LOCAL_IP="xx.xx.xx.xx"
NIC_NAME="xxxxxx"
MODEL_PATH="xxxxxxx/Qwen3-32B"
SERVED_MODEL_NAME="qwen3"
DATA_PARALLEL_SIZE=1
TENSOR_PARALLEL_SIZE=8
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf
echo "Starting vLLM on Series: $HARDWARE_SERIES"
rm -rf /root/ascend/log/*
rm -rf ./connector.log
if [ "$HARDWARE_SERIES" == "A2" ]; then
echo 200000 > /proc/sys/vm/nr_hugepages
export HCCL_IF_IP=$LOCAL_IP
export GLOO_SOCKET_IFNAME=$NIC_NAME
export TP_SOCKET_IFNAME=$NIC_NAME
export HCCL_SOCKET_IFNAME=$NIC_NAME
elif [ "$HARDWARE_SERIES" == "A3" ]; then
export ACL_OP_INIT_MODE=1
else
echo "Error: Invalid HARDWARE_SERIES. Set to 'A2' or 'A3'."
exit 1
fi
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
KV_CONFIG='{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"backend": "memcache",
"lookup_rpc_port": "0"
}
}'
CMD_ARGS=(
--model "$MODEL_PATH"
--served-model-name "$SERVED_MODEL_NAME"
--trust-remote-code
--enforce-eager
--data-parallel-size "$DATA_PARALLEL_SIZE"
--tensor-parallel-size "$TENSOR_PARALLEL_SIZE"
--port 30050
--max-num_seqs 20
--max-model-len 32768
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.9
--kv-transfer-config "$KV_CONFIG"
)
python -m vllm.entrypoints.openai.api_server "${CMD_ARGS[@]}" > log_mix.log 2>&1
echo "vLLM started. Log file: log_mix.log"
2、运行推理#
使用元戎作为 KV 池后端的示例#
软件要求:
在所有节点上安装
openyuanrong-datasystem(必须能够导入yr.datasystem)。
安装元戎数据系统#
pip install openyuanrong-datasystem
如果预构建包与您环境中的 CANN 或 Ascend 驱动版本不匹配,请在 vLLM Ascend 镜像中从源码构建元戎数据系统。请遵循官方元戎数据系统构建说明: https://atomgit.com/openeuler/yuanrong-datasystem
启动 etcd#
元戎数据系统使用 etcd 进行服务发现。以下示例启动一个单节点 etcd 集群:
ETCD_VERSION="v3.5.12"
ETCD_IP="127.0.0.1"
if [ "$(uname -m)" = "aarch64" ]; then
ETCD_ARCH="linux-arm64"
else
ETCD_ARCH="linux-amd64"
fi
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-${ETCD_ARCH}.tar.gz
tar -xvf etcd-${ETCD_VERSION}-${ETCD_ARCH}.tar.gz
cd etcd-${ETCD_VERSION}-${ETCD_ARCH}
sudo cp etcd etcdctl /usr/local/bin/
etcd \
--name etcd-single \
--data-dir /tmp/etcd-data \
--listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://${ETCD_IP}:2379 \
--listen-peer-urls http://0.0.0.0:2380 \
--initial-advertise-peer-urls http://${ETCD_IP}:2380 \
--initial-cluster etcd-single=http://${ETCD_IP}:2380 &
etcdctl --endpoints "${ETCD_IP}:2379" put key "value"
etcdctl --endpoints "${ETCD_IP}:2379" get key
对于生产环境,请参考官方 etcd 集群文档:https://etcd.io/docs/v3.7/op-guide/clustering/
启动数据系统工作节点#
使用 dscli 在每个节点上启动一个数据系统工作节点:
dscli start -w \
--worker_address "${WORKER_IP}:31501" \
--etcd_address "${ETCD_IP}:2379" \
--shared_memory_size_mb 40960 \
--enable_worker_worker_batch_get=true
--worker_address 的值稍后会被 DS_WORKER_ADDR 使用,因此请确保同一节点上的主机和端口保持一致。
有关更多参数,请参考元戎数据系统官方网站上的 dscli 使用文档:https://atomgit.com/openeuler/yuanrong-datasystem
要停止工作节点:
dscli stop --worker_address "${WORKER_IP}:31501"
环境变量配置#
在启动 vLLM 之前,在每个节点上设置以下环境变量:
变量 |
必需 |
默认值 |
描述 |
|---|---|---|---|
|
是 |
|
必须在所有节点上保持一致,以确保哈希生成的统一性。 |
|
是 |
不适用 |
数据系统工作节点地址,格式为 |
|
否 |
|
传递给元戎 |
|
否 |
|
传递给元戎 |
export PYTHONHASHSEED=0
export DS_WORKER_ADDR="${WORKER_IP}:31501"
export DS_ENABLE_EXCLUSIVE_CONNECTION=0
export DS_ENABLE_REMOTE_H2D=0
远程 H2D 要求#
仅在元戎数据系统部署中启用并验证了远程主机到设备传输时,才设置 DS_ENABLE_REMOTE_H2D=1:
在启动工作节点前预留足够的 2 MiB HugeTLB 页面。对于 40 GiB 共享内存,至少预留 20480 个 2 MiB 巨页。
以启用远程 H2D 的方式启动每个数据系统工作节点。工作节点启动命令必须包含
--remote_h2d_device_ids、--enable_huge_tlb true、--arena_per_tenant 1和--enable_fallocate false。建议使用多个可用的 NPU 设备 ID,例如在 8-NPU 节点上使用"0,1,2,3,4,5,6,7"。
dscli start -w \
--worker_address "${WORKER_IP}:31501" \
--etcd_address "${ETCD_IP}:2379" \
--shared_memory_size_mb 40960 \
--arena_per_tenant 1 \
--enable_huge_tlb true \
--enable_fallocate false \
--remote_h2d_device_ids "0,1,2,3,4,5,6,7" \
--enable_worker_worker_batch_get=true
确保元戎远程 H2D 所需的 NPU 驱动、固件和 CANN 工具包已安装且对工作进程可见。在容器中,挂载 Ascend 驱动路径、
npu-smi、hccn_tool、/etc/hccn.conf、/etc/ascend_install.info以及所需的/dev/davinci*设备。在启用客户端标志前验证 NPU 和 RoCE 环境:
# Check the current 2 MiB HugeTLB page size, total count, and free count.
grep -E "HugePages_Total|HugePages_Free|Hugepagesize" /proc/meminfo
# Optional: check 2 MiB HugeTLB pages on each NUMA node.
for node in /sys/devices/system/node/node*/hugepages/hugepages-2048kB; do
echo "$node total=$(cat "$node/nr_hugepages") free=$(cat "$node/free_hugepages")"
done
# Check that NPU devices and the driver are visible to the worker environment.
npu-smi info
# Check that the NPU topology is visible.
npu-smi info -t topo
# Check optical module detection on the selected local NPU.
hccn_tool -i <local_npu_id> -optical -g
# Check RoCE physical link status. The expected link status is UP.
for i in {0..7}; do hccn_tool -i $i -link -g; done
# Check the selected NPU IP address and reachability to the remote NPU.
hccn_tool -i <local_npu_id> -ip -g
hccn_tool -i <local_npu_id> -ping -g address <remote_npu_ip>
如果这些检查失败,请保持 DS_ENABLE_REMOTE_H2D=0 并使用默认的数据系统传输路径。
使用元戎后端运行 AscendStoreConnector#
使用 AscendStoreConnector 并设置 backend: "yuanrong":
python3 -m vllm.entrypoints.openai.api_server \
--model /xxxxx/Qwen2.5-7B-Instruct \
--port 8100 \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 10000 \
--block-size 128 \
--max-num-batched-tokens 4096 \
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_both",
"kv_load_failure_policy": "recompute",
"kv_connector_extra_config": {
"lookup_rpc_port": "1",
"backend": "yuanrong"
}
}'
lookup_rpc_port 是池化调度进程与工作进程之间使用的 RPC 端口。每个实例必须使用唯一的端口值。
注意事项#
元戎后端在调用数据系统之前会对 KV 键进行规范化处理。长度超过 255 个字符或包含不支持字符的键会被重写,因此在调试后端存储时,请勿依赖原始的键字符串。
元戎不需要额外的缓冲区预注册步骤。该后端在构建 blob 列表时直接使用设备指针。
2、运行推理#
常见问题#
1. Mooncake FAQ#
1.1 failed to put/get key#
当 vLLM 报告 put 或 get 操作失败时,首先检查错误是否由 Mooncake 自身报告。
如果错误由 Mooncake 报告:
对于
put失败,检查 Mooncake 日志是否包含NO_AVAILABLE_HANDLE或BatchPut failed ... due to insufficient space。这通常意味着逐出后的剩余空间不足以容纳一个BatchPut请求。确保逐出策略留下的空间(例如,1 - eviction_ratio所隐含的容量)可以容纳一次批量 put,或者考虑增加可用容量、增加逐出余量或减小批量大小。对于
get失败,检查 Mooncake 日志是否包含lease_expired_before_data_transfer_completed key=...或返回LEASE_EXPIRED。这意味着 KV 对象租约在数据传输完成前已过期。根据需要增加mooncake_master的--default_kv_lease_ttl,并使其大于ASCEND_CONNECT_TIMEOUT和ASCEND_TRANSFER_TIMEOUT。
如果错误不是由 Mooncake 报告的,则很可能是 HIXL (ascend_direct) 传输层问题。收集
/root/ascend/log/debug/plog下的 plog 文件,并检查该问题是否与已知的 HIXL 问题匹配。
关于 HIXL (ascend_direct) 的常见故障排除和问题定位指南,请参阅:https://gitcode.com/cann/hixl/wiki/HIXL常见问题定位手册.md
1.2 SSD FAQ#
1.2.1 SEGMENT_NOT_FOUND with SSD offload#
If client logs show OffloadObjectHeartbeat failed, error code is SEGMENT_NOT_FOUND, Master has unmounted the rank's LOCAL_DISK segment (usually after client_expired when Ping stops refreshing TTL). SSD offload on that rank stops until the segment is registered again.
Typical trigger (with enable_cpu_binding=true): Mooncake starts Ping during init, then vLLM-Ascend bind_cpus() runs migratepages/IRQ binding; the Ping thread is not pinned and can miss beats under the default client_ttl=10.
Mitigation |
注意事项 |
|---|---|
Temporary: raise Master TTL |
e.g. |
Recovery: upgrade Mooncake |
Versions > v0.3.11 (main branch) can remount |
Root fix: Mooncake Ping CPU affinity |
Pin the storage Ping thread to a release/isolated CPU (Mooncake-side change). Optional vLLM-Ascend cooperation to pass the release CPU per rank. |
Also restart Master together with vLLM to avoid stale segment_already_exists state when debugging restarts.
1.2.2 Fabric memory size alignment (A3 + ASCEND_ENABLE_USE_FABRIC_MEM=1)#
On A3 with fabric memory enabled, each fabric mem allocation must be an integer multiple of 1 GB (1073741824 bytes). Mooncake does not round sizes up automatically.
参数 |
Config source |
Alignment |
|---|---|---|
|
|
Each rank's segment size must be aligned to 1GB (e.g. |
|
export |
Must be aligned to 1GB. Default is 1280 MB (1.25 GB), which is not aligned and is too small for long-context SSD loads — size with Sizing MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES. |
local_buffer_size in mooncake.json is not used under fabric mem (vLLM-Ascend passes 0 to setup()).
Risk if misaligned: adxl MallocMem / aclrtMapMem fails with Invalid_Argument. With SSD offload enabled, a failed MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES allocation can segfault during FileStorage init and abort vLLM startup. Avoid values such as "1280MB", "512MB", or "1.5GB".
Fabric mem quota: Both global_segment_size and MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES are separate fabric mem allocations per rank. Their sizes add up against the HIXL fabric mem limit configured via ASCEND_GLOBAL_RESOURCE_CONFIG (e.g. "fabric_memory.max_capacity":32, unit GB per process — see HIXL docs). Rough budget per rank:
fabric_memory.max_capacity ≥ global_segment_size + MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES (+ headroom)
Risk if quota is too low: Some ranks fail with Memory_Allocation_Failure(EL0004) after global_segment_size succeeds but MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES allocation fails. Increase fabric_memory.max_capacity, reduce global_segment_size or MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES, or ensure the node has enough host memory.
Example (add to your vLLM startup script when SSD offload is on):
export ASCEND_ENABLE_USE_FABRIC_MEM=1
export MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES=1073741824 # 1 GB, fabric-mem aligned
set ASCEND_GLOBAL_RESOURCE_CONFIG only if fabric mem is too low.
# Per-rank fabric mem budget: 20 GB segment + 1 GB MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES → set max_capacity ≥ 22 (GB)
export ASCEND_GLOBAL_RESOURCE_CONFIG='{"fabric_memory.max_capacity":32}'
1.2.3 Sizing MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES#
When enable_ssd_offload=true, Mooncake allocates a separate per-rank SSD read/write buffer sized by MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES. This buffer is independent of global_segment_size in mooncake.json — increasing the segment does not fix BUFFER_OVERFLOW caused by an undersized MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES.
If the buffer is too small, SSD reads fail with BUFFER_OVERFLOW (error_code=-10) during FileStorage::AllocateBatch, and vLLM may fail when kv_load_failure_policy=fail.
If you encounter BUFFER_OVERFLOW during use, try increasing MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES. Do not set it higher than the Available KV cache memory value shown in vLLM worker logs:
(Worker_TP0_EP0 pid=21240) INFO 06-23 17:41:09 [worker.py:552] Available KV cache memory: XX
Example:
export MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES=10737418240 # 10 GB
Use byte literals only (10737418240). 10G / 10GB are ignored and fall back to the 1280 MB default.
Notes
--max-num-batched-tokensonly chunks prefill compute; it does not reduce the memory required byMOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES.
Host memory budget (single node)#
MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES is allocated per rank, in addition to global_segment_size:
host_memory_for_mooncake ≈ TP × (global_segment_size + MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES + local_buffer_size)
Ensure free -h available on the host exceeds this sum plus vLLM overhead. MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES does not need to fit inside global_segment_size.
Verify after tuning#
Startup: each rank logs
AlignedClientBufferAllocator: allocated <N> byteswith your configured size.Under load: no
BUFFER_OVERFLOW/Failed to get ... keys out of ... error_codes=[-10].If failures persist with a large buffer, check overlapping loads (
load_async).
2.Memcache 常见问题解答#
关于 Memcache 故障排除,请参阅:https://gitcode.com/Ascend/memcache/wiki/FAQ.md
3.DSv4 已知问题(临时)#
关于临时 DSv4 已知问题,请参阅:vllm-project/vllm-ascend#9975