Ascend Store 部署指南#

环境依赖#

  • 软件要求:

    • CANN >= 8.5.0

    • vLLM:main 分支

    • vLLM-Ascend:main 分支

    • mooncake:>= 0.3.9

KV 池参数说明#

kv_load_failure_policy:KV 加载失败处理策略#

kv_load_failure_policykv-transfer-config 中的顶级字段。

  • recompute:当 KV 加载失败时,vLLM 将请求回滚到最后一个有效前缀,并重新调度以重新计算失败的 KV 块。

  • fail:当 KV 加载失败时,受影响的请求将直接终止并返回错误。

vLLM 中的默认值为 fail。如果希望在 KV 加载失败后请求回退到重新计算,请将其设置为 recompute

使用 MultiConnector 时,请在 MultiConnector 顶级 kv-transfer-config 上配置 kv_load_failure_policy,而不是在子连接器上配置。

kv_connector_extra_config:池化的额外可配置参数#

参数

描述

lookup_rpc_port

池化调度器进程与工作进程间 RPC 通信的端口:每个实例需要配置唯一的端口。

load_async

是否启用异步加载。默认值为 false。

backend

设置 kvpool 的存储后端 (mooncake, memcache, yuanrong),默认为 mooncake

consumer_is_to_put

解码节点是否将 KV 缓存放入 KV 池。默认值为 false。

consumer_is_to_load

解码节点是否从 KV 池加载 KV 缓存。默认值为 false。

prefill_pp_size

Prefill PP 大小,当 Prefill 节点启用 PP 时需要设置。

prefill_pp_layer_partition

Prefill PP 层划分,当 Prefill 节点启用 PP 时需要设置。

环境变量配置#

为保证生成的哈希值一致,在启用 KV 池时,必须在所有节点上同步 PYTHONHASHSEED 环境变量。

export PYTHONHASHSEED=0

使用 Mooncake 作为 KV 池后端的示例#

  • 软件要求:

    • 检查 NPU HCCN 配置:

      确保环境中存在 hccn.conf 文件。如果使用 Docker,请将其挂载到容器中。

      cat /etc/hccn.conf
      
    • 安装 Mooncake

      Mooncake 是 Moonshot AI 提供的领先 LLM 服务 Kimi 的推理平台。 安装与编译指南:kvcache-ai/Mooncake。 首先,我们需要获取 Mooncake 项目。参考以下命令:

      git clone -b v0.3.9 --depth 1 https://github.com/kvcache-ai/Mooncake.git
      

      (可选)如果网络较差,请更换 go install 的 URL

      cd Mooncake
      sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh
      

      安装 MPI

      apt-get install mpich libmpich-dev -y
      

      安装相关依赖。不需要安装 Go 语言。

      bash dependencies.sh -y
      

      编译与安装

      mkdir build
      cd build
      cmake .. -DUSE_ASCEND_DIRECT=ON
      make -j
      make install
      

      设置环境变量

      注意:

      • 根据您具体的 Python 安装路径调整 Python 路径

      • 确保 /usr/local/lib/usr/local/lib64 已包含在您的 LD_LIBRARY_PATH

      export LD_LIBRARY_PATH=/usr/local/lib64/python3.12/site-packages/mooncake:$LD_LIBRARY_PATH
      

环境变量说明#

硬件

HDK 与 CANN 版本

导出命令

描述

800 I/T A3 系列

HDK >= 25.5
CANN >= 9.0.0
LingQu 计算网络 >= 1.5

export ASCEND_ENABLE_USE_FABRIC_MEM=1

推荐。启用统一内存地址直接传输方案。

800 I/T A3 系列

25.5.0<=HDK<26.0.0

export ASCEND_BUFFER_POOL=4:8

配置 NPU 设备上用于聚合和 KV 传输的缓冲区数量和大小(例如,4:8 表示 4 个 8MB 的缓冲区)。

800 I/T A2 系列

不适用

export HCCL_INTRA_ROCE_ENABLE=1

800 I/T A2 系列直接传输方案所需

嵌入式真实客户端模式(Mooncake ssd-offload.md 步骤 3A)#

  • 软件要求:

    • mooncake >= v0.3.11

启动 mooncake_master#

mooncake_master --rpc_port=50051 --enable_offload=true

字段

描述

enable_offload

设置为 true 以启用 SSD 卸载。

配置#

将以下字段添加到您的 mooncake.json 中:

{
    "local_hostname": "xx.xx.xx.xx",
    "metadata_server": "P2PHANDSHAKE",
    "protocol": "ascend",
    "use_ascend_direct": true,
    "device_name": "",
    "master_server_address": "xx.xx.xx.xx:50088",
    "global_segment_size": "1GB",
    "enable_ssd_offload": true,
    "ssd_offload_path": "/nvme/mooncake_offload"
}

字段

描述

enable_ssd_offload

设置为 true 以启用 SSD 卸载。不支持环境变量。

ssd_offload_path

enable_ssd_offloadtrue 时必填。 Mooncake 存储卸载的 KV 数据的本地目录的绝对路径(例如,/nvme/mooncake_offload)。该目录必须存在且可由 vLLM 进程写入;请在启动前创建它(mkdir -p <path>)。Mooncake 会拒绝相对路径、符号链接以及包含 .. 的路径。该值作为 SSD 存储根目录传递给 MooncakeDistributedStore.setup()(相当于独立客户端中的 MOONCAKE_OFFLOAD_FILE_STORAGE_PATH)。仅在 mooncake.json 中配置此字段;不支持环境变量。

运行嵌入式真实客户端#

在模式 A(嵌入式真实客户端)下,Mooncake 嵌入在 vLLM 中。当 vLLM 服务启动时,AscendStoreConnector / MooncakeBackend 会使用 mooncake.json 中的设置(包括启用 SSD 卸载时的 enable_ssd_offloadssd_offload_path)自动调用 MooncakeDistributedStore.setup()。无需单独的 mooncake_client 进程。

SSD 磁盘使用控制#

以下环境变量控制 SSD 卸载(bucket 后端)的磁盘空间使用:

环境变量

默认值

描述

MOONCAKE_OFFLOAD_BUCKET_MAX_TOTAL_SIZE

0

以字节为单位的驱逐阈值。当设置为 0 时,后端使用物理磁盘容量的 90% 作为配额。设置显式值以精确控制磁盘使用。

MOONCAKE_OFFLOAD_BUCKET_EVICTION_POLICY

none

驱逐策略:none(满时写入失败)、fifolru

MOONCAKE_OFFLOAD_TOTAL_SIZE_LIMIT_BYTES

2199023255552(2 TB)

全局最大磁盘使用限制。

由于每个 TP rank 在 ssd_offload_path 下使用独立的 SSD 子目录(rank_0/rank_1/、...),所有 rank 共享同一物理磁盘。为防止单个 rank 消耗过多空间,请设置显式的每 rank 配额。例如,对于 800 GB 磁盘和 8 个 TP rank:

# 800 GB total disk, 8 ranks, ~100 GB per rank
export MOONCAKE_OFFLOAD_BUCKET_MAX_TOTAL_SIZE=$((100 * 1024 * 1024 * 1024))
export MOONCAKE_OFFLOAD_BUCKET_EVICTION_POLICY=lru

注意事项#

  • 此功能需要 mooncake >= v0.3.11。

HIXL (ascend_direct) 后端常见问题解答#

关于 HIXL (ascend_direct) 的常见故障排除和问题定位指南,请参阅:https://gitcode.com/cann/hixl/wiki/HIXL常见问题定位手册.md

运行 Mooncake Master#

1.配置 mooncake.json#

环境变量 MOONCAKE_CONFIG_PATH 应配置为 mooncake.json 所在的完整路径。

{
    "metadata_server": "P2PHANDSHAKE",
    "protocol": "ascend",
    "device_name": "",
    "master_server_address": "xx.xx.xx.xx:50088",
    "global_segment_size": "1GB" (1024MB/1048576KB/1073741824B/1073741824),
    "preferred_segment": false,
    "prefer_alloc_in_same_node": true
}

metadata_server:配置为 P2PHANDSHAKEprotocol: 在 NPU 上必须设置为 'Ascend'。 device_name:"" master_server_address:配置为 master 服务的 IP 和端口。也可以通过 MOONCAKE_MASTER 环境变量设置,该变量优先级高于此配置项(适用于通过 Kubernetes 注入 master 地址)。 global_segment_size:每张卡注册到 KV 池的内存大小。需要按 1GB 对齐。 也可以通过 MOONCAKE_GLOBAL_SEGMENT_SIZE 环境变量设置,该变量优先级高于此配置项。 preferred_segment:向 KV 池放入对象时是否优先存储在本地段上。默认为 falseprefer_alloc_in_same_node:是否优先在同一节点上分配 KV。默认为 true

2.启动 mooncake_master#

在 mooncake 文件夹下:

mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.1 --default_kv_lease_ttl 11000

eviction_high_watermark_ratio 决定了 Mooncake Store 执行驱逐(eviction)的水位线,而 eviction_ratio 决定了将被驱逐的已存储对象的比例。default_kv_lease_ttl 控制 KV 对象的默认租约 TTL(毫秒);通过 --default_kv_lease_ttl 配置它,并使其大于 ASCEND_CONNECT_TIMEOUTASCEND_TRANSFER_TIMEOUT

PD 分离(Prefill-Decode Disaggregation)场景#

1.运行 prefill 节点和 decode 节点#

使用 MultiConnector 同时利用 MooncakeConnectorV1AscendStoreConnectorMooncakeConnectorV1 负责 kv_transfer(KV 传输),而 AscendStoreConnector 作为前缀缓存(prefix-cache)节点。

prefill 节点:

bash multi_producer.sh

multi_producer.sh 脚本的内容:

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONHASHSEED=0
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ACL_OP_INIT_MODE=1
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1

#Minimum retransmission timeout of the RDMA, equals 4.096 μs * 2 ^ timeout.
#Needs to satisfy the equation: ASCEND_TRANSFER_TIMEOUT > RDMA_TIMEOUT * 7, where 7 is the default number of retry for RDMA transfer.
#HCCL_RDMA_TIMEOUT also affects collective communication behavior and should be configured carefully.
export HCCL_RDMA_TIMEOUT=17

# Unit: ms. The timeout for one-sided communication connection establishment is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039). Users can adjust this value based on their specific setup.
# The recommended formula is: ASCEND_CONNECT_TIMEOUT = connection_time_per_card (typically within 500ms) × total_number_of_Decode_cards.
# This ensures that even in the worst-case scenario—where all Decode cards simultaneously attempt to connect to the same Prefill card the connection will not time out.
export ASCEND_CONNECT_TIMEOUT=10000

# Unit: ms. The timeout for one-sided communication transfer is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039).
export ASCEND_TRANSFER_TIMEOUT=10000

python3 -m vllm.entrypoints.openai.api_server \
    --model /xxxxx/Qwen2.5-7B-Instruct \
    --port 8100 \
    --trust-remote-code \
    --enforce-eager \
    --no-enable-prefix-caching \
    --tensor-parallel-size 1 \
    --data-parallel-size 1 \
    --max-model-len 32768 \
    --block-size 128 \
    --max-num-batched-tokens 16384 \
    --kv-transfer-config \
    '{
    "kv_connector": "MultiConnector",
    "kv_role": "kv_producer",
    "kv_load_failure_policy": "recompute",
    "kv_connector_extra_config": {
        "connectors": [
            {
                "kv_connector": "MooncakeConnectorV1",
                "kv_role": "kv_producer",
                "kv_port": "20001",
                "kv_connector_extra_config": {
                    "prefill": {
                        "dp_size": 1,
                        "tp_size": 1
                    },
                    "decode": {
                        "dp_size": 1,
                        "tp_size": 1
                    }
                }
            },
            {
                "kv_connector": "AscendStoreConnector",
                "kv_role": "kv_producer",
                "kv_connector_extra_config": {
                    "lookup_rpc_port":"0",
                    "backend": "mooncake"
                }
            }  
        ]
    }
    }'

decode 节点:

bash multi_consumer.sh

multi_consumer.sh 的内容:

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export PYTHONHASHSEED=0
export MOONCAKE_CONFIG_PATH="/xxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
export ACL_OP_INIT_MODE=1
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1
export HCCL_RDMA_TIMEOUT=17
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000

python3 -m vllm.entrypoints.openai.api_server \
    --model /xxxxx/Qwen2.5-7B-Instruct \
    --port 8200 \
    --trust-remote-code \
    --enforce-eager \
    --no-enable-prefix-caching \
    --tensor-parallel-size 1 \
    --data-parallel-size 1 \
    --max-model-len 32768 \
    --block-size 128 \
    --max-num-batched-tokens 16384 \
    --kv-transfer-config \
    '{
    "kv_connector": "MultiConnector",
    "kv_role": "kv_consumer",
    "kv_load_failure_policy": "recompute",
    "kv_connector_extra_config": {
        "connectors": [
        {
                "kv_connector": "MooncakeConnectorV1",
                "kv_role": "kv_consumer",
                "kv_port": "20002",
                "kv_connector_extra_config": {
                    "prefill": {
                        "dp_size": 1,
                        "tp_size": 1
                    },
                    "decode": {
                        "dp_size": 1,
                        "tp_size": 1
                    }
                }
            },
            {
                "kv_connector": "AscendStoreConnector",
                "kv_role": "kv_consumer",
                "kv_connector_extra_config": {
                    "lookup_rpc_port":"0",
                    "backend": "mooncake"
                }
            }
        ]
    }
    }'

目前,PD 解耦场景中的键值池默认仅存储由 Prefill 节点生成的 KV 缓存。在使用 MLA 的模型中,现已支持 Decode 节点存储 KV 缓存供 Prefill 节点使用,可通过在 AscendStoreConnector 中添加 consumer_is_to_put: true 来启用。如果 Prefill 节点启用了流水线并行(PP),则还需要设置 prefill_pp_sizeprefill_pp_layer_partition。示例如下:

{
    "kv_connector": "AscendStoreConnector",
    "kv_role": "kv_consumer",
    "kv_load_failure_policy": "recompute",
    "kv_connector_extra_config": {
        "lookup_rpc_port": "0",
        "backend": "mooncake",
        "consumer_is_to_put": true,
        "prefill_pp_size": 2,
        "prefill_pp_layer_partition": "30,31"
    }
}

2.启动代理服务器#

python vllm-ascend/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py \
    --host localhost\
    --prefiller-hosts localhost \
    --prefiller-ports 8100 \
    --decoder-hosts localhost\
    --decoder-ports 8200 \

将 localhost 更改为您的实际 IP 地址。

3.运行推理#

在命令中将 localhost、端口和模型权重路径配置为您自己的设置。

短问题:

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'

长问题:

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'

PD 混部推理#

1.运行混部脚本#

bash pd_mix.sh

pd_mix.sh 的内容:

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export PYTHONHASHSEED=0
export ACL_OP_INIT_MODE=1
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1
export HCCL_RDMA_TIMEOUT=17
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000

python3 -m vllm.entrypoints.openai.api_server \
    --model /xxxxx/Qwen2.5-7B-Instruct \
    --port 8100 \
    --trust-remote-code \
    --enforce-eager \
    --no-enable-prefix-caching \
    --tensor-parallel-size 1 \
    --data-parallel-size 1 \
    --max-model-len 32768 \
    --block-size 128 \
    --max-num-batched-tokens 16384 \
    --kv-transfer-config \
    '{
    "kv_connector": "AscendStoreConnector",
    "kv_role": "kv_both",
    "kv_load_failure_policy": "recompute",
    "kv_connector_extra_config": {
        "lookup_rpc_port":"1",
        "backend": "mooncake"
    }
}' > mix.log 2>&1

2.运行推理#

在命令中将 localhost、端口和模型权重路径配置为您自己的设置。发送的请求将仅到达混部脚本所在的端口,无需启动单独的代理。

短问题:

curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'

长问题:

curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'

注意:对于启用了 ASCEND_BUFFER_POOL 的 MooncakeStore,建议在实际性能基准测试前进行预热阶段。

这是因为当涉及设备到设备通信时,HCCL 单边通信连接是在实例启动后延迟创建的。目前,需要在所有设备之间建立全连接。建立这些连接会引入一次性时间开销和持续的设备内存消耗(每个连接消耗 4 MB 设备内存)。

对于预热,建议发送输入序列长度为 8K、输出序列长度为 1 的请求,请求总数应为设备(卡/芯片)数量的 2-3 倍。

使用 Memcache 作为 KV 池后端的示例#

安装 Memcache#

MemCache 依赖于 MemFabric。因此,必须先安装 MemFabric。在 MemFabric 安装完成后,再安装 Memcache。

pip install memfabric-hybrid
pip install memcache-hybrid

配置 Memcache 配置文件#

mmc-meta.conf:

ock.mmc.meta_service_url = tcp://xx.xx.xx.xx:5000
ock.mmc.meta_service.config_store_url = tcp://xx.xx.xx.xx:6000
ock.mmc.log_level = error

mmc-local.conf:

ock.mmc.meta_service_url = tcp://xx.xx.xx.xx:5000
ock.mmc.local_service.config_store_url = tcp://xx.xx.xx.xx:6000
ock.mmc.log_level = error
ock.mmc.local_service.world_size = 256
ock.mmc.local_service.protocol = device_sdma
ock.mmc.local_service.dram.size = 1GB

关键要点:

参数

描述

ock.mmc.meta_service_url

配置主节点的 IP 地址和端口号。P 节点和 D 节点的 IP 地址和端口号可以相同。

ock.mmc.local_service.config_store_url

配置主节点的 IP 地址和端口号。P 节点和 D 节点的 IP 地址和端口号可以相同。

ock.mmc.local_service.world_size

本地服务的总数,包括未来将添加的服务。

ock.mmc.local_service.protocol

device_rdma(当设备 ROCE 可用时,A2 和 A3 支持,推荐用于 A2),device_sdma(当 HCCS 可用时,A3 支持,推荐用于 A3)。目前不支持异构协议设置。

ock.mmc.local_service.dram.size

设置主节点占用的内存大小。配置的值是每张卡占用的内存大小。

运行 Memcache Master#

启动 MetaService 服务。

export MMC_META_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-meta.conf

python -c "from memcache_hybrid import MetaService; MetaService.main()"

PD 分离(Prefill-Decode Disaggregation)场景#

1.运行 prefill 节点和 decode 节点#

使用 MultiConnector 同时利用 MooncakeConnectorV1AscendStoreConnectorMooncakeConnectorV1 执行 kv_transfer,而 AscendStoreConnector 启用 KV 缓存池。

800I A2/800T A2/800I A3/800T A3 系列#

run_prefill.sh/run_decode.sh:

#!/bin/bash

ROLE="prefill"              # prefill / decode
HARDWARE_SERIES="A2"        # A2 (800I/800T A2) or A3 (800I/800T A3)
LOCAL_IP="xx.xx.xx.xx"
NIC_NAME="xxxxxx"

MODEL_PATH="xxxxxxx/Qwen3-32B"
SERVED_MODEL_NAME="qwen3"
DATA_PARALLEL_SIZE=1
TENSOR_PARALLEL_SIZE=8
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf

if [ "$ROLE" == "prefill" ]; then
    KV_ROLE="kv_producer"
    KV_PORT="20001"
    LOOKUP_RPC_PORT="0"
else
    KV_ROLE="kv_consumer"
    KV_PORT="20002"
    LOOKUP_RPC_PORT="1"
fi

echo "Starting vLLM on Series: $HARDWARE_SERIES, Role: $ROLE"

rm -rf /root/ascend/log/*
rm -rf ./connector.log

if [ "$HARDWARE_SERIES" == "A2" ]; then
    echo 200000 > /proc/sys/vm/nr_hugepages
    export HCCL_IF_IP=$LOCAL_IP
    export GLOO_SOCKET_IFNAME=$NIC_NAME
    export TP_SOCKET_IFNAME=$NIC_NAME
    export HCCL_SOCKET_IFNAME=$NIC_NAME

elif [ "$HARDWARE_SERIES" == "A3" ]; then
    export ACL_OP_INIT_MODE=1
else
    echo "Error: Invalid HARDWARE_SERIES. Set to 'A2' or 'A3'."
    exit 1
fi

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1

KV_CONFIG='{
  "kv_connector": "MultiConnector",
  "kv_role": "'$KV_ROLE'",
  "kv_connector_extra_config": {
    "connectors": [
      {
        "kv_connector": "MooncakeConnectorV1",
        "kv_role": "'$KV_ROLE'",
        "kv_port": "'$KV_PORT'",
        "kv_connector_extra_config": {
          "prefill": {
            "dp_size": '$DATA_PARALLEL_SIZE',
            "tp_size": '$TENSOR_PARALLEL_SIZE'
          },
          "decode": {
            "dp_size": '$DATA_PARALLEL_SIZE',
            "tp_size": '$TENSOR_PARALLEL_SIZE'
          }
        }
      },
      {
        "kv_connector": "AscendStoreConnector",
        "kv_role": "'$KV_ROLE'",
        "kv_connector_extra_config": {
          "backend": "memcache",
          "lookup_rpc_port": "'$LOOKUP_RPC_PORT'"
        }
      }
    ]
  }
}'

CMD_ARGS=(
  --model "$MODEL_PATH"
  --served-model-name "$SERVED_MODEL_NAME"
  --trust-remote-code
  --enforce-eager
  --data-parallel-size "$DATA_PARALLEL_SIZE"
  --tensor-parallel-size "$TENSOR_PARALLEL_SIZE"
  --port 30050
  --max-num_seqs 20
  --max-model-len 32768
  --max-num-batched-tokens 16384
  --gpu-memory-utilization 0.9
  --kv-transfer-config "$KV_CONFIG"
)

python -m vllm.entrypoints.openai.api_server "${CMD_ARGS[@]}" > log_${ROLE}.log 2>&1

echo "vLLM started. Log file: log_${ROLE}.log"

2、启动代理服务器#

3、运行推理#

PD 混部场景#

1.运行混部脚本#

800I A2/800T A2/800I A3/800T A3 系列#

Run_pd_mix.sh:

#!/bin/bash

HARDWARE_SERIES="A2"        # A2 (800I/800T A2) or A3 (800I/800T A3)
LOCAL_IP="xx.xx.xx.xx"
NIC_NAME="xxxxxx"

MODEL_PATH="xxxxxxx/Qwen3-32B"
SERVED_MODEL_NAME="qwen3"
DATA_PARALLEL_SIZE=1
TENSOR_PARALLEL_SIZE=8
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf

echo "Starting vLLM on Series: $HARDWARE_SERIES"

rm -rf /root/ascend/log/*
rm -rf ./connector.log

if [ "$HARDWARE_SERIES" == "A2" ]; then
    echo 200000 > /proc/sys/vm/nr_hugepages
    export HCCL_IF_IP=$LOCAL_IP
    export GLOO_SOCKET_IFNAME=$NIC_NAME
    export TP_SOCKET_IFNAME=$NIC_NAME
    export HCCL_SOCKET_IFNAME=$NIC_NAME

elif [ "$HARDWARE_SERIES" == "A3" ]; then
    export ACL_OP_INIT_MODE=1
else
    echo "Error: Invalid HARDWARE_SERIES. Set to 'A2' or 'A3'."
    exit 1
fi

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1

KV_CONFIG='{
  "kv_connector": "AscendStoreConnector",
  "kv_role": "kv_both",
  "kv_connector_extra_config": {
     "backend": "memcache",
     "lookup_rpc_port": "0"
  }
}'

CMD_ARGS=(
  --model "$MODEL_PATH"
  --served-model-name "$SERVED_MODEL_NAME"
  --trust-remote-code
  --enforce-eager
  --data-parallel-size "$DATA_PARALLEL_SIZE"
  --tensor-parallel-size "$TENSOR_PARALLEL_SIZE"
  --port 30050
  --max-num_seqs 20
  --max-model-len 32768
  --max-num-batched-tokens 16384
  --gpu-memory-utilization 0.9
  --kv-transfer-config "$KV_CONFIG"
)

python -m vllm.entrypoints.openai.api_server "${CMD_ARGS[@]}" > log_mix.log 2>&1

echo "vLLM started. Log file: log_mix.log"

2、运行推理#

使用元戎作为 KV 池后端的示例#

  • 软件要求:

    • 在所有节点上安装 openyuanrong-datasystem(必须能够导入 yr.datasystem)。

安装元戎数据系统#

pip install openyuanrong-datasystem

如果预构建包与您环境中的 CANN 或 Ascend 驱动版本不匹配,请在 vLLM Ascend 镜像中从源码构建元戎数据系统。请遵循官方元戎数据系统构建说明: https://atomgit.com/openeuler/yuanrong-datasystem

启动 etcd#

元戎数据系统使用 etcd 进行服务发现。以下示例启动一个单节点 etcd 集群:

ETCD_VERSION="v3.5.12"
ETCD_IP="127.0.0.1"
if [ "$(uname -m)" = "aarch64" ]; then
  ETCD_ARCH="linux-arm64"
else
  ETCD_ARCH="linux-amd64"
fi
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-${ETCD_ARCH}.tar.gz
tar -xvf etcd-${ETCD_VERSION}-${ETCD_ARCH}.tar.gz
cd etcd-${ETCD_VERSION}-${ETCD_ARCH}
sudo cp etcd etcdctl /usr/local/bin/

etcd \
  --name etcd-single \
  --data-dir /tmp/etcd-data \
  --listen-client-urls http://0.0.0.0:2379 \
  --advertise-client-urls http://${ETCD_IP}:2379 \
  --listen-peer-urls http://0.0.0.0:2380 \
  --initial-advertise-peer-urls http://${ETCD_IP}:2380 \
  --initial-cluster etcd-single=http://${ETCD_IP}:2380 &

etcdctl --endpoints "${ETCD_IP}:2379" put key "value"
etcdctl --endpoints "${ETCD_IP}:2379" get key

对于生产环境,请参考官方 etcd 集群文档:https://etcd.io/docs/v3.7/op-guide/clustering/

启动数据系统工作节点#

使用 dscli 在每个节点上启动一个数据系统工作节点:

dscli start -w \
  --worker_address "${WORKER_IP}:31501" \
  --etcd_address "${ETCD_IP}:2379" \
  --shared_memory_size_mb 40960 \
  --enable_worker_worker_batch_get=true

--worker_address 的值稍后会被 DS_WORKER_ADDR 使用,因此请确保同一节点上的主机和端口保持一致。

有关更多参数,请参考元戎数据系统官方网站上的 dscli 使用文档:https://atomgit.com/openeuler/yuanrong-datasystem

要停止工作节点:

dscli stop --worker_address "${WORKER_IP}:31501"

环境变量配置#

在启动 vLLM 之前,在每个节点上设置以下环境变量:

变量

必需

默认值

描述

PYTHONHASHSEED

0

必须在所有节点上保持一致,以确保哈希生成的统一性。

DS_WORKER_ADDR

不适用

数据系统工作节点地址,格式为 <主机>:<端口>。此值必须与本地 dscli start --worker_address 的值匹配。

DS_ENABLE_EXCLUSIVE_CONNECTION

0

传递给元戎 HeteroClient.enable_exclusive_connection。当您的部署需要时,使用 1 来启用独占连接模式。

DS_ENABLE_REMOTE_H2D

0

传递给元戎 HeteroClient.enable_remote_h2d。仅当满足以下远程 H2D 要求后才使用 1

export PYTHONHASHSEED=0
export DS_WORKER_ADDR="${WORKER_IP}:31501"
export DS_ENABLE_EXCLUSIVE_CONNECTION=0
export DS_ENABLE_REMOTE_H2D=0

远程 H2D 要求#

仅在元戎数据系统部署中启用并验证了远程主机到设备传输时,才设置 DS_ENABLE_REMOTE_H2D=1

  • 在启动工作节点前预留足够的 2 MiB HugeTLB 页面。对于 40 GiB 共享内存,至少预留 20480 个 2 MiB 巨页。

  • 以启用远程 H2D 的方式启动每个数据系统工作节点。工作节点启动命令必须包含 --remote_h2d_device_ids--enable_huge_tlb true--arena_per_tenant 1--enable_fallocate false。建议使用多个可用的 NPU 设备 ID,例如在 8-NPU 节点上使用 "0,1,2,3,4,5,6,7"

dscli start -w \
  --worker_address "${WORKER_IP}:31501" \
  --etcd_address "${ETCD_IP}:2379" \
  --shared_memory_size_mb 40960 \
  --arena_per_tenant 1 \
  --enable_huge_tlb true \
  --enable_fallocate false \
  --remote_h2d_device_ids "0,1,2,3,4,5,6,7" \
  --enable_worker_worker_batch_get=true
  • 确保元戎远程 H2D 所需的 NPU 驱动、固件和 CANN 工具包已安装且对工作进程可见。在容器中,挂载 Ascend 驱动路径、npu-smihccn_tool/etc/hccn.conf/etc/ascend_install.info 以及所需的 /dev/davinci* 设备。

  • 在启用客户端标志前验证 NPU 和 RoCE 环境:

# Check the current 2 MiB HugeTLB page size, total count, and free count.
grep -E "HugePages_Total|HugePages_Free|Hugepagesize" /proc/meminfo

# Optional: check 2 MiB HugeTLB pages on each NUMA node.
for node in /sys/devices/system/node/node*/hugepages/hugepages-2048kB; do
  echo "$node total=$(cat "$node/nr_hugepages") free=$(cat "$node/free_hugepages")"
done

# Check that NPU devices and the driver are visible to the worker environment.
npu-smi info

# Check that the NPU topology is visible.
npu-smi info -t topo

# Check optical module detection on the selected local NPU.
hccn_tool -i <local_npu_id> -optical -g

# Check RoCE physical link status. The expected link status is UP.
for i in {0..7}; do hccn_tool -i $i -link -g; done

# Check the selected NPU IP address and reachability to the remote NPU.
hccn_tool -i <local_npu_id> -ip -g
hccn_tool -i <local_npu_id> -ping -g address <remote_npu_ip>

如果这些检查失败,请保持 DS_ENABLE_REMOTE_H2D=0 并使用默认的数据系统传输路径。

使用元戎后端运行 AscendStoreConnector#

使用 AscendStoreConnector 并设置 backend: "yuanrong"

python3 -m vllm.entrypoints.openai.api_server \
    --model /xxxxx/Qwen2.5-7B-Instruct \
    --port 8100 \
    --trust-remote-code \
    --enforce-eager \
    --no-enable-prefix-caching \
    --tensor-parallel-size 1 \
    --data-parallel-size 1 \
    --max-model-len 10000 \
    --block-size 128 \
    --max-num-batched-tokens 4096 \
    --kv-transfer-config \
    '{
    "kv_connector": "AscendStoreConnector",
    "kv_role": "kv_both",
    "kv_load_failure_policy": "recompute",
    "kv_connector_extra_config": {
        "lookup_rpc_port": "1",
        "backend": "yuanrong"
    }
}'

lookup_rpc_port 是池化调度进程与工作进程之间使用的 RPC 端口。每个实例必须使用唯一的端口值。

注意事项#

  • 元戎后端在调用数据系统之前会对 KV 键进行规范化处理。长度超过 255 个字符或包含不支持字符的键会被重写,因此在调试后端存储时,请勿依赖原始的键字符串。

  • 元戎不需要额外的缓冲区预注册步骤。该后端在构建 blob 列表时直接使用设备指针。

2、运行推理#