Ascend Store部署#
环境依赖#
软件要求:
Python >= 3.10, < 3.12
CANN == 8.3.rc2
PyTorch == 2.8.0, torch-npu == 2.8.0
vLLM:main 分支
vLLM-Ascend:main 分支
KV 池参数说明#
kv_connector_extra_config: KV 池的其他可配置参数。 lookup_rpc_port: 池化调度进程与 Worker 进程之间 RPC 通信的端口:每个实例都需要唯一的端口配置。 load_async: 是否启用异步加载。默认值为 false。 backend: 设置 KV 池的存储后端,默认值为 mooncake。
环境变量配置#
为保证生成的哈希值一致,在启用 KV 池时,必须在所有节点上同步 PYTHONHASHSEED 环境变量。
export PYTHONHASHSEED=0
使用 Mooncake 作为 KV 池后端的示例#
软件要求:
检查 NPU HCCN 配置:
确保环境中存在
hccn.conf文件。如果使用 Docker,请将其挂载到容器中。cat /etc/hccn.conf安装 Mooncake
Mooncake 是由 Moonshot AI 提供的领先 LLM 服务 Kimi 的推理平台。 安装与编译指南:https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries。 首先,我们需要获取 Mooncake 项目。参考以下命令:
git clone -b v0.3.7.post2 --depth 1 https://github.com/kvcache-ai/Mooncake.git
(可选)如果网络较差,请更换 go install 的 URL
cd Mooncake sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh
安装 MPI
apt-get install mpich libmpich-dev -y
安装相关依赖。不需要安装 Go 语言。
bash dependencies.sh -y
编译与安装
mkdir build cd build cmake .. -DUSE_ASCEND_DIRECT=ON make -j make install
设置环境变量
注意:
根据您具体的 Python 安装路径调整 Python 路径
确保
/usr/local/lib和/usr/local/lib64已包含在您的LD_LIBRARY_PATH中
export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LIBRARY_PATH
运行 Mooncake Master#
1.配置 mooncake.json#
环境变量 MOONCAKE_CONFIG_PATH 应配置为 mooncake.json 所在的完整路径。
{
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
"device_name": "",
"master_server_address": "xx.xx.xx.xx:50088",
"global_segment_size": "1GB" (1024MB/1048576KB/1073741824B/1073741824)
}
metadata_server: 配置为 P2PHANDSHAKE。 protocol: 在 NPU 上必须设置为 'Ascend'。 device_name: "" master_server_address: 配置为 Master 服务的 IP 和端口。 global_segment_size: 每张卡注册到 KV 池的内存大小。
2.启动 mooncake_master#
在 mooncake 文件夹下:
mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.1
eviction_high_watermark_ratio 决定了 Mooncake Store 执行驱逐(eviction)的水位线,而 eviction_ratio 决定了将被驱逐的已存储对象的比例。
PD 分离(Prefill-Decode Disaggregation)场景#
1.运行 prefill 节点和 decode 节点#
使用 MultiConnector 同时利用 MooncakeConnectorV1 和 AscendStoreConnector。MooncakeConnectorV1 负责 kv_transfer(KV 传输),而 AscendStoreConnector 作为前缀缓存(prefix-cache)节点。
prefill 节点:
bash multi_producer.sh
multi_producer.sh 脚本的内容:
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONHASHSEED=0
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ACL_OP_INIT_MODE=1
# ASCEND_BUFFER_POOL is the environment variable for configuring the number and size of buffer on NPU Device for aggregation and KV transfer,the value 4:8 means we allocate 4 buffers of size 8MB.
export ASCEND_BUFFER_POOL=4:8
# Unit: ms. The timeout for one-sided communication connection establishment is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039). Users can adjust this value based on their specific setup.
# The recommended formula is: ASCEND_CONNECT_TIMEOUT = connection_time_per_card (typically within 500ms) × total_number_of_Decode_cards.
# This ensures that even in the worst-case scenario—where all Decode cards simultaneously attempt to connect to the same Prefill card the connection will not time out.
export ASCEND_CONNECT_TIMEOUT=10000
# Unit: ms. The timeout for one-sided communication transfer is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039).
export ASCEND_TRANSFER_TIMEOUT=10000
python3 -m vllm.entrypoints.openai.api_server \
--model /xxxxx/Qwen2.5-7B-Instruct \
--port 8100 \
--trust-remote-code \
--enforce-eager \
--no_enable_prefix_caching \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 10000 \
--block-size 128 \
--max-num-batched-tokens 4096 \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_producer",
"kv_port": "20001",
"kv_connector_extra_config": {
"use_ascend_direct": true,
"prefill": {
"dp_size": 1,
"tp_size": 1
},
"decode": {
"dp_size": 1,
"tp_size": 1
}
}
},
{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"lookup_rpc_port":"0",
"backend": "mooncake"
}
}
]
}
}'
decode 节点:
bash multi_consumer.sh
multi_consumer.sh 的内容:
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export PYTHONHASHSEED=0
export MOONCAKE_CONFIG_PATH="/xxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
export ACL_OP_INIT_MODE=1
export ASCEND_BUFFER_POOL=4:8
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000
python3 -m vllm.entrypoints.openai.api_server \
--model /xxxxx/Qwen2.5-7B-Instruct \
--port 8200 \
--trust-remote-code \
--enforce-eager \
--no_enable_prefix_caching \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 10000 \
--block-size 128 \
--max-num-batched-tokens 4096 \
--kv-transfer-config \
'{
"kv_connector": "MultiConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnectorV1",
"kv_role": "kv_consumer",
"kv_port": "20002",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 1,
"tp_size": 1
},
"decode": {
"dp_size": 1,
"tp_size": 1
}
}
},
{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"lookup_rpc_port":"0",
"backend": "mooncake"
}
}
]
}
}'
目前,PD 分离场景中的 KV 池默认仅存储由 Prefill 节点生成的 KV Cache。在采用 MLA 的模型中,现在已支持 Decode 节点存储 KV Cache 供 Prefill 节点使用,通过在 AscendStoreConnector 中添加 consumer_is_to_put: true 来启用。如果 Prefill 节点启用了流水线并行 (PP),还需要设置 prefill_pp_size 或 prefill_pp_layer_partition。示例如下:
{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"lookup_rpc_port":"0",
"backend": "mooncake"
"consumer_is_to_put": true,
"prefill_pp_size": 2
"prefill_pp_layer_partition": "30,31"
}
}
2.启动代理服务器 (proxy_server)。#
python vllm-ascend/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py \
--host localhost\
--prefiller-hosts localhost \
--prefiller-ports 8100 \
--decoder-hosts localhost\
--decoder-ports 8200 \
将 localhost 更改为您的实际 IP 地址。
3.运行推理#
在命令中将 localhost、端口和模型权重路径配置为您自己的设置。
短问题:
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'
长问题:
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'
混部 (Colocation) 场景#
1.运行混部脚本#
bash mixed_department.sh
mixed_department.sh 的内容:
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export PYTHONHASHSEED=0
export ACL_OP_INIT_MODE=1
export ASCEND_BUFFER_POOL=4:8
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000
python3 -m vllm.entrypoints.openai.api_server \
--model /xxxxx/Qwen2.5-7B-Instruct \
--port 8100 \
--trust-remote-code \
--enforce-eager \
--no_enable_prefix_caching \
--tensor-parallel-size 1 \
--data-parallel-size 1 \
--max-model-len 10000 \
--block-size 128 \
--max-num-batched-tokens 4096 \
--kv-transfer-config \
'{
"kv_connector": "AscendStoreConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"lookup_rpc_port":"1",
"backend": "mooncake"
}
}' > mix.log 2>&1
2.运行推理#
在命令中将 localhost、端口和模型权重路径配置为您自己的设置。发送的请求将仅到达混部脚本所在的端口,无需启动单独的代理。
短问题:
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'
长问题:
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'