Ascend Store 部署指南#

环境依赖#

  • 软件:

    • CANN >= 8.5.0

    • vLLM:main 分支

    • vLLM-Ascend:main 分支

    • mooncake:>= 0.3.9

KV Pool 参数说明#

kv_connector_extra_config: 池化的额外可配置参数#

参数

描述

lookup_rpc_port

池化调度进程与工作进程间 RPC 通信端口:每个实例需要配置唯一端口。

load_async

是否启用异步加载。默认值为 false。

backend

设置 kvpool 的存储后端,默认为 mooncake。

consumer_is_to_put

Decode 节点是否将 KV Cache 放入 KV Pool。默认值为 false。

consumer_is_to_load

Decode 节点是否从 KV Pool 加载 KV cache。默认值为 false。

prefill_pp_size

Prefill PP 大小,当 Prefill 节点启用 PP 时需要设置。

prefill_pp_layer_partition

Prefill PP 层划分,当 Prefill 节点启用 PP 时需要设置。

环境变量配置#

为保证哈希生成的一致性,启用 KV Pool 时,需要在所有节点上同步 PYTHONHASHSEED 环境变量。

export PYTHONHASHSEED=0

使用 Mooncake 作为 KV Pool 后端的示例#

  • 软件:

    • 检查 NPU HCCN 配置:

      确保环境中存在 hccn.conf 文件。如果使用 Docker,请将其挂载到容器中。

      cat /etc/hccn.conf
      
    • 安装 Mooncake

      Mooncake 是 Moonshot AI 提供的领先 LLM 服务 Kimi 的推理平台。 安装与编译指南:kvcache-ai/Mooncake。 首先,我们需要获取 Mooncake 项目。参考以下命令:

      git clone -b v0.3.9 --depth 1 https://github.com/kvcache-ai/Mooncake.git
      

      (可选)如果网络状况不佳,替换 go install 的 URL

      cd Mooncake
      sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh
      

      安装 mpi

      apt-get install mpich libmpich-dev -y
      

      安装相关依赖。无需安装 Go。

      bash dependencies.sh -y
      

      编译并安装

      mkdir build
      cd build
      cmake .. -DUSE_ASCEND_DIRECT=ON
      make -j
      make install
      

      设置环境变量

      注意:

      • 根据您具体的 Python 安装调整 Python 路径

      • 确保 /usr/local/lib/usr/local/lib64 在您的 LD_LIBRARY_PATH

      export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LIBRARY_PATH
      

环境变量说明#

硬件

HDK 与 CANN 版本

导出命令

描述

800 I/T A3 系列

HDK >= 26.0.0
CANN >= 9.0.0

export ASCEND_ENABLE_USE_FABRIC_MEM=1

推荐。启用统一内存地址直传方案。

800 I/T A3 系列

25.5.0<=HDK<26.0.0

export ASCEND_BUFFER_POOL=4:8

配置 NPU 设备上用于聚合和 KV 传输的缓冲区数量和大小(例如,4:8 表示 4 个 8MB 的缓冲区)。

800 I/T A2 系列

不适用

export HCCL_INTRA_ROCE_ENABLE=1

800 I/T A2 系列直传方案所需

HIXL (ascend_direct) 后端常见问题#

关于 HIXL (ascend_direct) 的常见故障排除和问题定位指南,请参阅:https://gitcode.com/cann/hixl/wiki/HIXL常见问题定位手册.md

运行 Mooncake Master#

1.配置 mooncake.json#

环境变量 MOONCAKE_CONFIG_PATH 配置为 mooncake.json 所在位置的完整路径。

{
    "metadata_server": "P2PHANDSHAKE",
    "protocol": "ascend",
    "device_name": "",
    "master_server_address": "xx.xx.xx.xx:50088",
    "global_segment_size": "1GB" (1024MB/1048576KB/1073741824B/1073741824)
}

metadata_server: 配置为 P2PHANDSHAKEprotocol: 在 NPU 上必须设置为 'Ascend'。device_name: "" master_server_address: 配置 master 服务的 IP 和端口。 global_segment_size: 每张卡注册到 KV Pool 的内存大小。需要对齐到 1GB。

2.启动 mooncake_master#

在 mooncake 文件夹下:

mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.1 --default_kv_lease_ttl 11000

eviction_high_watermark_ratio 决定了 Mooncake Store 执行淘汰的水位线,eviction_ratio 决定了将被淘汰的存储对象比例。default_kv_lease_ttl 控制 KV 对象的默认租约 TTL(毫秒);通过 --default_kv_lease_ttl 配置,并保持其大于 ASCEND_CONNECT_TIMEOUTASCEND_TRANSFER_TIMEOUT

PD 解耦场景#

1.运行 prefill 节点和 decode 节点#

使用 MultiConnector 同时利用 MooncakeConnectorV1AscendStoreConnectorMooncakeConnectorV1 执行 kv_transfer,而 AscendStoreConnector 作为 prefix-cache 节点。

prefill 节点:

bash multi_producer.sh

multi_producer.sh 脚本的内容:

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONHASHSEED=0
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ACL_OP_INIT_MODE=1
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1

#Minimum retransmission timeout of the RDMA, equals 4.096 μs * 2 ^ timeout.
#Needs to satisfy the equation: ASCEND_TRANSFER_TIMEOUT > RDMA_TIMEOUT * 7, where 7 is the default number of retry for RDMA transfer.
#HCCL_RDMA_TIMEOUT also affects collective communication behavior and should be configured carefully.
export HCCL_RDMA_TIMEOUT=17

# Unit: ms. The timeout for one-sided communication connection establishment is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039). Users can adjust this value based on their specific setup.
# The recommended formula is: ASCEND_CONNECT_TIMEOUT = connection_time_per_card (typically within 500ms) × total_number_of_Decode_cards.
# This ensures that even in the worst-case scenario—where all Decode cards simultaneously attempt to connect to the same Prefill card the connection will not time out.
export ASCEND_CONNECT_TIMEOUT=10000

# Unit: ms. The timeout for one-sided communication transfer is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039).
export ASCEND_TRANSFER_TIMEOUT=10000

python3 -m vllm.entrypoints.openai.api_server \
    --model /xxxxx/Qwen2.5-7B-Instruct \
    --port 8100 \
    --trust-remote-code \
    --enforce-eager \
    --no-enable-prefix-caching \
    --tensor-parallel-size 1 \
    --data-parallel-size 1 \
    --max-model-len 32768 \
    --block-size 128 \
    --max-num-batched-tokens 16384 \
    --kv-transfer-config \
    '{
    "kv_connector": "MultiConnector",
    "kv_role": "kv_producer",
    "kv_connector_extra_config": {
        "connectors": [
            {
                "kv_connector": "MooncakeConnectorV1",
                "kv_role": "kv_producer",
                "kv_port": "20001",
                "kv_connector_extra_config": {
                    "prefill": {
                        "dp_size": 1,
                        "tp_size": 1
                    },
                    "decode": {
                        "dp_size": 1,
                        "tp_size": 1
                    }
                }
            },
            {
                "kv_connector": "AscendStoreConnector",
                "kv_role": "kv_producer",
                "kv_connector_extra_config": {
                    "lookup_rpc_port":"0",
                    "backend": "mooncake"
                }
            }  
        ]
    }
    }'

decode 节点:

bash multi_consumer.sh

multi_consumer.sh 的内容:

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export PYTHONHASHSEED=0
export MOONCAKE_CONFIG_PATH="/xxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
export ACL_OP_INIT_MODE=1
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1
export HCCL_RDMA_TIMEOUT=17
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000

python3 -m vllm.entrypoints.openai.api_server \
    --model /xxxxx/Qwen2.5-7B-Instruct \
    --port 8200 \
    --trust-remote-code \
    --enforce-eager \
    --no-enable-prefix-caching \
    --tensor-parallel-size 1 \
    --data-parallel-size 1 \
    --max-model-len 32768 \
    --block-size 128 \
    --max-num-batched-tokens 16384 \
    --kv-transfer-config \
    '{
    "kv_connector": "MultiConnector",
    "kv_role": "kv_consumer",
    "kv_connector_extra_config": {
        "connectors": [
        {
                "kv_connector": "MooncakeConnectorV1",
                "kv_role": "kv_consumer",
                "kv_port": "20002",
                "kv_connector_extra_config": {
                    "prefill": {
                        "dp_size": 1,
                        "tp_size": 1
                    },
                    "decode": {
                        "dp_size": 1,
                        "tp_size": 1
                    }
                }
            },
            {
                "kv_connector": "AscendStoreConnector",
                "kv_role": "kv_consumer",
                "kv_connector_extra_config": {
                    "lookup_rpc_port":"0",
                    "backend": "mooncake"
                }
            }
        ]
    }
    }'

目前,PD 解耦中的键值池默认仅存储 Prefill 节点生成的 kv cache。在使用 MLA 的模型中,现已支持 Decode 节点存储 kv cache 供 Prefill 节点使用,通过在 AscendStoreConnector 中添加 consumer_is_to_put: true 来启用。如果 Prefill 节点启用了 PP,则还需要设置 prefill_pp_sizeprefill_pp_layer_partition。示例如下:

{
    "kv_connector": "AscendStoreConnector",
    "kv_role": "kv_consumer",
    "kv_connector_extra_config": {
        "lookup_rpc_port": "0",
        "backend": "mooncake",
        "consumer_is_to_put": true,
        "prefill_pp_size": 2,
        "prefill_pp_layer_partition": "30,31"
    }
}

2.启动 proxy_server#

python vllm-ascend/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py \
    --host localhost\
    --prefiller-hosts localhost \
    --prefiller-ports 8100 \
    --decoder-hosts localhost\
    --decoder-ports 8200 \

将 localhost 更改为您的实际 IP 地址。

3.运行推理#

将命令中的 localhost、端口和模型权重路径配置为您自己的设置。

短问题:

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'

长问题:

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'

PD混合推理#

1.运行混合部署脚本#

bash pd_mix.sh

pd_mix.sh 内容:

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export PYTHONHASHSEED=0
export ACL_OP_INIT_MODE=1
#A3
export ASCEND_ENABLE_USE_FABRIC_MEM=1
#A2
#export HCCL_INTRA_ROCE_ENABLE=1
export HCCL_RDMA_TIMEOUT=17
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000

python3 -m vllm.entrypoints.openai.api_server \
    --model /xxxxx/Qwen2.5-7B-Instruct \
    --port 8100 \
    --trust-remote-code \
    --enforce-eager \
    --no-enable-prefix-caching \
    --tensor-parallel-size 1 \
    --data-parallel-size 1 \
    --max-model-len 32768 \
    --block-size 128 \
    --max-num-batched-tokens 16384 \
    --kv-transfer-config \
    '{
    "kv_connector": "AscendStoreConnector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
        "lookup_rpc_port":"1",
        "backend": "mooncake"
    }
}' > mix.log 2>&1

2.运行推理#

将命令中的 localhost、端口和模型权重路径配置为您自己的设置。发送的请求只会到达混合部署脚本所在的端口,无需启动单独的代理。

短问题:

curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'

长问题:

curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'

注意:对于启用了 ASCEND_BUFFER_POOL 的 MooncakeStore,建议在实际运行性能基准测试之前进行预热阶段。

这是因为当涉及设备到设备通信时,HCCL 单边通信连接是在实例启动后延迟创建的。目前,需要在所有设备之间建立全连接。建立这些连接会引入一次性时间开销和持续的设备内存消耗(每个连接消耗 4 MB 设备内存)。

对于预热,建议发送输入序列长度为 8K、输出序列长度为 1 的请求,请求总数为设备(卡/芯片)数量的 2-3 倍。

使用 Memcache 作为 KV 池后端的示例#

安装 Memcache#

MemCache 依赖于 MemFabric。因此,必须先安装 MemFabric。在 memfabric 安装完成后,再安装 memcache。

配置 memcache 配置文件#

配置路径:/usr/local/memcache_hybrid/latest/config/ 配置文件参数说明https://gitcode.com/Ascend/memcache/blob/develop/doc/memcache_config.md

设置 TLS 证书配置。如果禁用 TLS,则无需上传证书。如果启用 TLS,则需要上传证书。

# mmc-meta.conf
ock.mmc.tls.enable = false
ock.mmc.config_store.tls.enable = false

# mmc-local.conf
ock.mmc.tls.enable = false
ock.mmc.config_store.tls.enable = false
ock.mmc.local_service.hcom.tls.enable = false

建议您将 mmc-local.conf 和 mmc-meta.conf 复制到您自己的路径并进行修改,并将 MMC_META_CONFIG_PATH 环境变量设置为您自己的 mmc-meta.conf 文件的路径。

mmc-meta.conf:

# Meta service start-up url
# It will automatically modified to PodIP at Pod startup in K8s meta service cluster master-standby high availability scenario
ock.mmc.meta_service_url = tcp://xx.xx.xx.xx:5000
# config store url, It will automatically modified to PodIP at Pod startup in K8s
ock.mmc.meta_service.config_store_url = tcp://xx.xx.xx.xx:6000
# Enable or disable high availability deployment
ock.mmc.meta.ha.enable = false
# Log level: debug, info, warn, error
ock.mmc.log_level = error
# Log directory path, supports both relative and absolute paths, the system will automatically append 'logs' directory.
# The absolute log path at default value is '/path/to/mmc_meta_service/../logs'
# If the path of mmc_meta_service is '/usr/local/mxc/memfabric_hybrid/latest/aarch64-linux/bin'
# Then the path of log is '/usr/local/mxc/memfabric_hybrid/latest/aarch64-linux/logs'
ock.mmc.log_path = .
# Log rotation file size, unit is MB, value range [1,500]
ock.mmc.log_rotation_file_size = 20
# Log rotation file count, value range [1,50]
ock.mmc.log_rotation_file_count = 50

# The threshold that triggers eviction, measured as a percentage of space usage
# 'put' operation will trigger eviction when the threshold is exceeded
ock.mmc.evict_threshold_high = 90
# The target threshold of eviction, measured as a percentage of space usage
ock.mmc.evict_threshold_low = 80

# TLS configuration for metaservice
ock.mmc.tls.enable = false
ock.mmc.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
ock.mmc.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
ock.mmc.tls.cert.path = /opt/ock/security/certs/server.cert.pem
ock.mmc.tls.key.path = /opt/ock/security/certs/server.private.key.pem
ock.mmc.tls.key.pass.path = /opt/ock/security/certs/server.passphrase
ock.mmc.tls.package.path = /opt/ock/security/libs/
ock.mmc.tls.decrypter.path =

# TLS configuration for config store
ock.mmc.config_store.tls.enable = false
ock.mmc.config_store.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
ock.mmc.config_store.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
ock.mmc.config_store.tls.cert.path = /opt/ock/security/certs/server.cert.pem
ock.mmc.config_store.tls.key.path = /opt/ock/security/certs/server.private.key.pem
ock.mmc.config_store.tls.key.pass.path = /opt/ock/security/certs/server.passphrase
ock.mmc.config_store.tls.package.path = /opt/ock/security/libs/
ock.mmc.config_store.tls.decrypter.path =

关键要点:

参数

描述

ock.mmc.meta_service_url

配置主节点的 IP 地址和端口号。P 节点和 D 节点的 IP 地址和端口号可以相同。

ock.mmc.meta_service.config_store_url

配置主节点的 IP 地址和端口号。P 节点和 D 节点的 IP 地址和端口号可以相同。

ock.mmc.meta.ha.enable

设置为 false 以禁用 TLS 认证修改。

ock.mmc.config_store.tls.enable

设置为 false 以禁用 TLS 认证修改。

mmc-local.conf:

# Meta service start-up url
# K8s meta service cluster master-standby high availability scenario: ClusterIP address
# Non-HA scenario: keep consistent with the same name configuration in mmc-meta.conf
ock.mmc.meta_service_url = tcp://xx.xx.xx.xx:5000
# Log level: debug, info, warn, error
ock.mmc.log_level = error

# TLS configurations for metaservice
ock.mmc.tls.enable = false
ock.mmc.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
ock.mmc.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
ock.mmc.tls.cert.path = /opt/ock/security/certs/client.cert.pem
ock.mmc.tls.key.path = /opt/ock/security/certs/client.private.key.pem
ock.mmc.tls.key.pass.path = /opt/ock/security/certs/client.passphrase
ock.mmc.tls.package.path = /opt/ock/security/libs/
ock.mmc.tls.decrypter.path =

# Total count of local service, including services that will be add in the future
ock.mmc.local_service.world_size = 256
# config store url, it will automatically modified to PodIP at Pod startup in HA scenario
# keep consistent with the same name configuration in mmc-meta.conf
ock.mmc.local_service.config_store_url = tcp://xx.xx.xx.xx:6000
# TLS configurations for config_store
ock.mmc.config_store.tls.enable = false
ock.mmc.config_store.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
ock.mmc.config_store.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
ock.mmc.config_store.tls.cert.path = /opt/ock/security/certs/client.cert.pem
ock.mmc.config_store.tls.key.path = /opt/ock/security/certs/client.private.key.pem
ock.mmc.config_store.tls.key.pass.path = /opt/ock/security/certs/client.passphrase
ock.mmc.config_store.tls.package.path = /opt/ock/security/libs/
ock.mmc.config_store.tls.decrypter.path =

# Data transfer protocol, 'host_rdma': rdma over host; 'host_tcp': tcp over host; 'device_rdma': rdma over device; 'device_sdma': sdma over device
ock.mmc.local_service.protocol = device_sdma
# HBM/DRAM space usage, configuration type supports 134217728, 2048KB/2048K, 200MB/200mb/200m, 2.5GB or 1TB, case-insensitive, the maximum value is 1TB
# The system automatically calculates and aligns downwards to 2MB (host_sdma or host_tcp) or 1GB (device_sdma or device_rdma)
# After alignment, the HBM size and DRAM size cannot both be 0 at the same time
ock.mmc.local_service.dram.size = 2GB
ock.mmc.local_service.hbm.size = 0

# If the protocol is host_rdma, the ip needs to be set as RDMA network card ip. Use 'show_gids' command to query it
ock.mmc.local_service.hcom_url = tcp://127.0.0.1:7000
# HCOM TLS config
ock.mmc.local_service.hcom.tls.enable = false
ock.mmc.local_service.hcom.tls.ca.path = /opt/ock/security/certs/ca.cert.pem
ock.mmc.local_service.hcom.tls.ca.crl.path = /opt/ock/security/certs/ca.crl.pem
ock.mmc.local_service.hcom.tls.cert.path = /opt/ock/security/certs/client.cert.pem
ock.mmc.local_service.hcom.tls.key.path = /opt/ock/security/certs/client.private.key.pem
ock.mmc.local_service.hcom.tls.key.pass.path = /opt/ock/security/certs/client.passphrase
ock.mmc.local_service.hcom.tls.decrypter.path =

# The total retry duration (retry interval is 200ms) when client requests meta service and the connection does not exist
# Default value is 0, means no-retry and return immediately, value range [0, 600000]
ock.mmc.client.retry_milliseconds = 0

ock.mmc.client.timeout.seconds = 60

# read/write thread pool size, value range [1, 64]
ock.mmc.client.read_thread_pool.size = 16
ock.mmc.client.write_thread_pool.size = 2

关键要点:

参数

描述

ock.mmc.meta_service_url

配置主节点的 IP 地址和端口号。P 节点和 D 节点的 IP 地址和端口号可以相同。

ock.mmc.local_service.config_store_url

配置主节点的 IP 地址和端口号。P 节点和 D 节点的 IP 地址和端口号可以相同。

ock.mmc.local_service.world_size

本地服务的总数,包括未来将添加的服务。

ock.mmc.local_service.protocol

host_rdma (默认), device_rdma (A2 和 A3 在设备 ROCE 可用时支持,推荐用于 A2), device_sdma (A3 在 HCCS 可用时支持,推荐用于 A3)。目前不支持异构协议设置。

ock.mmc.local_service.dram.size

设置主节点占用的内存大小。配置的值为每张卡占用的内存大小。

ock.mmc.meta.ha.enable

设置为 false 以禁用 TLS 认证修改。

ock.mmc.config_store.tls.enable

设置为 false 以禁用 TLS 认证修改。

Memcache 环境变量#

source /usr/local/memcache_hybrid/set_env.sh
source /usr/local/memfabric_hybrid/set_env.sh
# Configuring Environment Variables in the Configuration File
export MMC_META_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-meta.conf

运行 Memcache 主节点#

启动 MetaService 服务。

1. Set environment variables for the configuration file.
export MMC_META_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-meta.conf

2. Access the Python console or compile the following Python script to start the process:
from memcache_hybrid import MetaService
MetaService.main()

启动 MetaService 服务的方法 2。

source /usr/local/memcache_hybrid/set_env.sh
source /usr/local/memfabric_hybrid/set_env.sh
export MMC_META_CONFIG_PATH=/home/memcache/shell/mmc-meta.conf # Set it to the path of your own configuration file.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/python3.11.10/lib/
/usr/local/memcache_hybrid/latest/aarch64-linux/bin/mmc_meta_service

PD 解耦场景#

1.运行 prefill 节点和 decode 节点#

使用 MultiConnector 同时利用 MooncakeConnectorV1AscendStoreConnectorMooncakeConnectorV1 执行 kv_transfer,而 AscendStoreConnector 启用 KV 缓存池

800I A2/800T A2 系列#

prefill 节点:

rm -rf /root/ascend/log/*

source /usr/local/memfabric_hybrid/set_env.sh
source /usr/local/memcache_hybrid/set_env.sh

# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf

# nic_name can be looked up in ifconfig
nic_name="xxxxxx"
local_ip="xx.xx.xx.xx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1

rm -rf ./connector.log
vllm serve xxxxxxx/Qwen3-32B \
  --host 0.0.0.0 \
  --port 30050 \
  --enforce-eager \
  --data-parallel-size 2 \
  --tensor-parallel-size 4 \
  --seed 1024 \
  --served-model-name qwen3 \
  --max-model-len 32768 \
  --max-num-batched-tokens 16384 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 20 \
  --no-enable-prefix-caching \
  --kv-transfer-config \
    '{
            "kv_connector": "MultiConnector",
            "kv_role": "kv_producer",
            "engine_id": "2",
            "kv_connector_extra_config": {
                "connectors": [
                {
                            "kv_connector": "MooncakeConnectorV1",
                            "kv_role": "kv_producer",
                            "kv_port": "20001",
                            "kv_connector_extra_config": {
                                    "prefill": {
                                            "dp_size": 2,
                                            "tp_size": 4
                                    },
                                    "decode": {
                                            "dp_size": 2,
                                            "tp_size": 4
                                    }
                            }
                    },
                    {
                            "kv_connector": "AscendStoreConnector",
                            "kv_role": "kv_producer",
                            "kv_connector_extra_config":{
                                    "backend": "memcache",
                                    "lookup_rpc_port":"0"
                            }
                    }  
                ]
            }
    }' > log_p.log 2>&1

decode 节点:

rm -rf /root/ascend/log/*

source /usr/local/memfabric_hybrid/set_env.sh
source /usr/local/memcache_hybrid/set_env.sh

# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf

# nic_name can be looked up in ifconfig
nic_name="xxxxxx"
local_ip="xx.xx.xx.xx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1

rm -rf ./connector.log
vllm serve xxxxxxx/Qwen3-32B \
  --host 0.0.0.0 \
  --port 30060 \
  --enforce-eager \
  --data-parallel-size 2 \
  --tensor-parallel-size 4 \
  --seed 1024 \
  --served-model-name qwen3 \
  --max-model-len 32768 \
  --max-num-batched-tokens 16384 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 20 \
  --no-enable-prefix-caching \
  --kv-transfer-config \
  '{
        "kv_connector": "MultiConnector",
        "kv_role": "kv_consumer",
        "kv_connector_extra_config": {
                "connectors": [
                {
                                "kv_connector": "MooncakeConnectorV1",
                                "kv_role": "kv_consumer",
                                "kv_port": "20002",
                                "kv_connector_extra_config": {
                                        "prefill": {
                                                "dp_size": 2,
                                                "tp_size": 4
                                        },
                                        "decode": {
                                                "dp_size": 2,
                                                "tp_size": 4
                                        }
                                }
                    } ,
            {  
                               "kv_connector": "AscendStoreConnector",
                               "kv_role": "kv_consumer",
                               "kv_connector_extra_config":{
                                    "backend": "memcache",
                                    "lookup_rpc_port":"1"
                               }
                       }  

                ]
        }
  }' > log_d.log 2>&1

800I A3/800T A3 系列#

prefill 节点:

rm -rf /root/ascend/log/*

# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/shell/mmc-local.conf

export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export ACL_OP_INIT_MODE=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024


python -m vllm.entrypoints.openai.api_server \
  --model=xxxxxxxxx/DeepSeek-R1 \
  --served-model-name dsv3 \
  --trust-remote-code \
  --enforce-eager \
  --data-parallel-size 2 \
  --tensor-parallel-size 8 \
  --port 30050 \
  --max-num-seqs 20 \
  --max-model-len 32768 \
  --max-num-batched-tokens 16384 \
  --enable_expert_parallel \
  --quantization ascend \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching \
  --kv-transfer-config \
 '{
  "kv_connector": "MultiConnector",
  "kv_role": "kv_producer",
  "engine_id": "2",
  "kv_connector_extra_config": {
   "connectors": [
   {
     "kv_connector": "MooncakeConnectorV1",
     "kv_role": "kv_producer",
     "kv_port": "20001",
     "kv_connector_extra_config": {
      "prefill": {
       "dp_size": 2,
       "tp_size": 8
      },
      "decode": {
       "dp_size": 2,
       "tp_size": 8
      }
     }
    },
    {
     "kv_connector": "AscendStoreConnector",
     "kv_role": "kv_producer",
     "kv_connector_extra_config":{
      "backend": "memcache",
      "lookup_rpc_port":"0"
     }
    }  
   ]
  }
 }' > log_p.log 2>&1 

decode 节点:

rm -rf /root/ascend/log/*

# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/shell/mmc-local.conf

export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export ACL_OP_INIT_MODE=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024

python -m vllm.entrypoints.openai.api_server \
  --model=xxxxxxxxxxxxxxxx/DeepSeek \
  --served-model-name dsv3 \
  --trust-remote-code \
  --data-parallel-size 2 \
  --tensor-parallel-size 8 \
  --port 30060 \
  --max-model-len 32768 \
  --max-num-batched-tokens 16384 \
  --enforce-eager\
  --quantization ascend \
  --no-enable-prefix-caching \
  --max-num-seqs 20 \
  --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
  --enable_expert_parallel \
  --gpu-memory-utilization 0.9 \
  --kv-transfer-config \
  '{
 "kv_connector": "MultiConnector",
 "kv_role": "kv_consumer",
 "kv_connector_extra_config": {
  "connectors": [
  {
    "kv_connector": "MooncakeConnectorV1",
    "kv_role": "kv_consumer",
    "kv_port": "20002",
    "kv_connector_extra_config": {
     "prefill": {
      "dp_size": 2,
      "tp_size": 8
     },
     "decode": {
      "dp_size": 2,
      "tp_size": 8
     }
    }
   },
    {
    "kv_connector": "AscendStoreConnector",
    "kv_role": "kv_consumer",
    "kv_connector_extra_config":{
                "backend": "memcache",
                "lookup_rpc_port":"1"
    }
   }  
  ]
 }
  }' > log_d.log 2>&1

2、启动 proxy_server#

3、运行推理#

PD混合场景#

1.运行混合部署脚本#

800I A2/800T A2 系列#

deepseek 模型需要在双节点集群中运行。

Run_pd_mix_1.sh:

rm -rf /root/ascend/log/*

source /usr/local/memfabric_hybrid/set_env.sh
source /usr/local/memcache_hybrid/set_env.sh

# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf

# nic_name can be looked up in ifconfig
nic_name="xxxxxxx"
local_ip="xx.xx.xx.xx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name


export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1

rm -rf ./connector.log
vllm serve xxxxxxx/DeepSeek-R1 \
  --host 0.0.0.0 \
  --port 30050 \
  --enforce-eager \
  --data-parallel-size 2 \
  --data-parallel-size-local 1 \
  --api-server-count 2 \
  --data-parallel-address 141.61.33.167 \
  --data-parallel-rpc-port 13348  \
  --tensor-parallel-size 8 \
  --seed 1024 \
  --served-model-name deepseek \
  --max-model-len 32768 \
  --max-num-batched-tokens 16384 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --quantization ascend \
  --max-num-seqs 20 \
  --enable-expert-parallel \
  --no-enable-prefix-caching \
  --kv-transfer-config \
  '{
        "kv_connector": "AscendStoreConnector",
        "kv_role": "kv_both",
        "kv_connector_extra_config": {
                "backend": "memcache",
                "lookup_rpc_port":"0"
           }
  }' > log_pd_mix_1.log 2>&1

Run_pd_mix_2.sh:

rm -rf /root/ascend/log/*

source /usr/local/memfabric_hybrid/set_env.sh
source /usr/local/memcache_hybrid/set_env.sh

# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/mmc-local.conf

# nic_name can be looked up in ifconfig
nic_name="xxxxxxx"
local_ip="xx.xx.xx.xx"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name

export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_USE_V1=1
# export VLLM_TORCH_PROFILER_DIR="./vllm-profiling"
# export VLLM_TORCH_PROFILER_WITH_STACK=0

rm -rf ./connector.log
vllm serve xxxxxxx/DeepSeek-R1 \
  --host 0.0.0.0 \
  --port 30050 \
  --headless  \
  --enforce-eager \
  --data-parallel-size 2 \
  --data-parallel-size-local 1 \
  --data-parallel-start-rank 1 \
  --data-parallel-address 141.61.33.167 \
  --data-parallel-rpc-port 13348  \
  --tensor-parallel-size 8 \
  --seed 1024 \
  --served-model-name deepseek \
  --max-model-len 32768 \
  --max-num-batched-tokens 16384 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --quantization ascend \
  --max-num-seqs 20 \
  --enable-expert-parallel \
  --no-enable-prefix-caching \
  --kv-transfer-config \
   '{
        "kv_connector": "AscendStoreConnector",
        "kv_role": "kv_both",
        "kv_connector_extra_config": {
                "backend": "memcache",
                "lookup_rpc_port":"0"
           }
  }' > log_pd_mix_2.log 2>&1

800I A3/800T A3 系列#

bash pd_mix.sh

pd_mix.sh 内容:

rm -rf /root/ascend/log/*

# memcache:
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/home/memcache/shell/mmc-local.conf

export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export ACL_OP_INIT_MODE=1
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

export PYTHONHASHSEED=0
export HCCL_BUFFSIZE=1024


python -m vllm.entrypoints.openai.api_server \
  --model=xxxxxxx/DeepSeek-R1 \
  --served-model-name dsv3 \
  --trust-remote-code \
  --enforce-eager \
  -dp 2 \
  -tp 8 \
  --port 30050 \
  --max-num-seqs 20 \
  --max-model-len 32768 \
  --max-num-batched-tokens 16384 \
  --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
  --compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --enable_expert_parallel \
  --quantization ascend \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching \
  --kv-transfer-config \
  '{
      "kv_connector": "AscendStoreConnector",
      "kv_role": "kv_both",
      "kv_connector_extra_config": {
        "backend": "memcache",
        "lookup_rpc_port":"0"
      }
  }' > log_pd_mix.log 2>&1