使用 Volcano Kthena#
本指南介绍如何使用 vLLM-Ascend 在华为昇腾 NPU 上运行 预填充-解码 (PD) 分离 推理,并由 Kthena 在 Kubernetes 上负责编排。关于 Kthena 对 vLLM 的支持,请参考 使用 Kthena 部署 vLLM。
1.什么是预填充-解码分离?#
大语言模型推理自然地分为两个阶段:
预填充
处理输入令牌并构建键值 (KV) 缓存。
批处理友好、高吞吐量,非常适合 NPU 并行执行。
解码
消耗 KV 缓存来生成输出令牌。
延迟敏感、内存密集,更具顺序性。
从客户端的角度来看,这仍然表现为一个统一的聊天/补全端点。
2.使用 Kthena 在 Kubernetes 上部署#
Kthena 是一个 Kubernetes 原生的 LLM 推理平台,它改变了组织在生产环境中部署和管理大语言模型的方式。它基于声明式模型生命周期管理和智能请求路由构建,为 LLM 推理工作负载提供高性能和企业级可扩展性。在本示例中,我们使用了三个关键的自定义资源定义 (CRD):
ModelServing— 定义工作负载(预填充和解码角色)。ModelServer— 管理 PD 分组和内部路由。ModelRoute— 暴露一个稳定的模型端点。
本节使用 deepseek-ai/DeepSeek-V2-Lite 作为示例,但您可以替换为 vLLM-Ascend 支持的任何模型。
2.1 前提条件#
拥有昇腾 NPU 节点的 Kubernetes 集群:
对应不同 NPU 驱动程序的资源名称可能略有不同。例如:
如果使用 MindCluster,请使用
huawei.com/Ascend310P或huawei.com/Ascend910。如果运行在华为云 CCE(云容器引擎)上且已安装 CCE AI 套件插件 (Ascend NPU),请使用
huawei.com/ascend-310或huawei.com/ascend-1980。
已安装 Kthena。请遵循 Kthena 安装指南。
2.2 在 Kubernetes 上部署预填充-解码分离的 DeepSeek-V2-Lite#
Kthena 提供了一个具体示例:volcano-sh/kthena
使用以下命令进行部署:
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/model-serving/prefill-decode-disaggregation.yaml
或
cat << EOF | kubectl apply -f -
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelServing
metadata:
name: deepseek-v2-lite
namespace: dev
spec:
schedulerName: volcano
replicas: 1
recoveryPolicy: ServingGroupRecreate
template:
restartGracePeriodSeconds: 60
roles:
- name: prefill
replicas: 1
entryTemplate:
spec:
initContainers:
- name: downloader
imagePullPolicy: Always
image: ghcr.io/volcano-sh/downloader:latest
args:
- --source
- deepseek-ai/DeepSeek-V2-Lite
- --output-dir
- /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
volumeMounts:
- name: models
mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
containers:
- name: runtime
image: ghcr.io/volcano-sh/runtime:latest
ports:
- containerPort: 8100
args:
- --port
- "8100"
- --engine
- vllm
- --pod
- $(POD_NAME).$(NAMESPACE)
- --model
- deepseek-v2-lite
- --engine-base-url
- http://localhost:8000
- name: vllm
image: ghcr.io/volcano-sh/kthena-engine:vllm-ascend_v0.10.1rc1_mooncake_v0.3.5
ports:
- containerPort: 8000
env:
- name: HF_HUB_OFFLINE
value: "1"
- name: HCCL_IF_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: TP_SOCKET_IFNAME
value: eth0
- name: HCCL_SOCKET_IFNAME
value: eth0
- name: VLLM_LOGGING_LEVEL
value: DEBUG
- name: AscendRealDevices
valueFrom:
fieldRef:
fieldPath: metadata.annotations['huawei.com/AscendReal']
args:
- "/mnt/cache/deepseek-ai/DeepSeek-V2-Lite/"
- "--served-model-name"
- "deepseek-ai/DeepSeekV2"
- "--tensor-parallel-size"
- "2"
- "--gpu-memory-utilization"
- "0.8"
- "--max-model-len"
- "8192"
- "--max-num-batched-tokens"
- "8192"
- "--trust-remote-code"
- "--enforce-eager"
- "--kv-transfer-config"
- '{"kv_connector":"MooncakeConnectorV1","kv_buffer_device":"npu","kv_role":"kv_producer","kv_parallel_size":1,"kv_port":"20001","engine_id":"0","kv_rank":0,"kv_connector_extra_config":{"prefill":{"dp_size":2,"tp_size":2},"decode":{"dp_size":2,"tp_size":2}}}'
imagePullPolicy: Always
resources:
limits:
cpu: "8"
memory: 64Gi
huawei.com/ascend-1980: "4"
requests:
cpu: "8"
memory: 64Gi
huawei.com/ascend-1980: "4"
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /health
port: 8000
livenessProbe:
initialDelaySeconds: 900
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /health
port: 8000
volumeMounts:
- name: models
mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
readOnly: true
- name: hccn-config
mountPath: /etc/hccn.conf
readOnly: true
- name: shared-memory-volume
mountPath: /dev/shm
volumes:
- name: models
hostPath:
path: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
type: DirectoryOrCreate
- name: hccn-config
hostPath:
path: /etc/hccn.conf
type: File
- name: shared-memory-volume
emptyDir:
sizeLimit: 256Mi
medium: Memory
- name: decode
replicas: 1
entryTemplate:
spec:
initContainers:
- name: downloader
imagePullPolicy: Always
image: ghcr.io/volcano-sh/downloader:latest
args:
- --source
- deepseek-ai/DeepSeek-V2-Lite
- --output-dir
- /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
volumeMounts:
- name: models
mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
containers:
- name: vllm
image: ghcr.io/volcano-sh/kthena-engine:vllm-ascend_v0.10.1rc1_mooncake_v0.3.5
ports:
- containerPort: 8000
env:
- name: HF_HUB_OFFLINE
value: "1"
- name: HCCL_IF_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: TP_SOCKET_IFNAME
value: eth0
- name: HCCL_SOCKET_IFNAME
value: eth0
- name: VLLM_LOGGING_LEVEL
value: DEBUG
- name: AscendRealDevices
valueFrom:
fieldRef:
fieldPath: metadata.annotations['huawei.com/AscendReal']
args:
- "/mnt/cache/deepseek-ai/DeepSeek-V2-Lite/"
- "--served-model-name"
- "deepseek-ai/DeepSeekV2"
- "--tensor-parallel-size"
- "2"
- "--gpu-memory-utilization"
- "0.8"
- "--max-model-len"
- "8192"
- "--max-num-batched-tokens"
- "16384"
- "--trust-remote-code"
- "--no-enable-prefix-caching"
- "--enforce-eager"
- "--kv-transfer-config"
- '{"kv_connector":"MooncakeConnectorV1","kv_buffer_device":"npu","kv_role":"kv_consumer","kv_parallel_size":1,"kv_port":"20002","engine_id":"1","kv_rank":1,"kv_connector_extra_config":{"prefill":{"dp_size":2,"tp_size":2},"decode":{"dp_size":2,"tp_size":2}}}'
imagePullPolicy: Always
resources:
limits:
cpu: "8"
memory: 64Gi
huawei.com/ascend-1980: "4"
requests:
cpu: "8"
memory: 64Gi
huawei.com/ascend-1980: "4"
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /health
port: 8000
livenessProbe:
initialDelaySeconds: 900
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /health
port: 8000
volumeMounts:
- name: models
mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
readOnly: true
- name: hccn-config
mountPath: /etc/hccn.conf
readOnly: true
- name: shared-memory-volume
mountPath: /dev/shm
volumes:
- name: models
hostPath:
path: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
type: DirectoryOrCreate
- name: hccn-config
hostPath:
path: /etc/hccn.conf
type: File
- name: shared-memory-volume
emptyDir:
sizeLimit: 256Mi
medium: Memory
EOF
您应该会看到如下 Pod:
deepseek-v2-lite-0-prefill-0-0deepseek-v2-lite-0-decode-0-0
为了启用 LLM 访问,我们仍需要通过 ModelServer 和 ModelRoute 配置路由层。
2.3 ModelServer:PD 分组管理#
ModelServer 资源:
通过标签选择
ModelServing工作负载。将预填充和解码 Pod 分组成 PD 对。
配置 KV 连接器细节和超时时间。
暴露一个内部的 gRPC/HTTP 接口。
使用以下命令创建 ModelServer:
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/kthena-router/ModelServer-prefill-decode-disaggregation.yaml
或
cat << EOF | kubectl apply -f -
apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelServer
metadata:
name: deepseek-v2
namespace: dev
spec:
kvConnector:
type: nixl
workloadSelector:
matchLabels:
modelserving.volcano.sh/name: deepseek-v2-lite
pdGroup:
groupKey: "modelserving.volcano.sh/group-name"
prefillLabels:
modelserving.volcano.sh/role: prefill
decodeLabels:
modelserving.volcano.sh/role: decode
workloadPort:
port: 8000
model: "deepseek-ai/DeepSeekV2"
inferenceEngine: "vLLM"
trafficPolicy:
timeout: 10s
EOF
2.4 ModelRoute:面向用户的端点#
ModelRoute 资源将模型名称(例如 "deepseek-ai/DeepSeekV2")映射到 ModelServer。
清单示例:
cat << EOF | kubectl apply -f -
apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelRoute
metadata:
name: deepseek-v2
namespace: dev
spec:
modelName: "deepseek-ai/DeepSeekV2"
rules:
- name: "default"
targetModels:
- modelServerName: "deepseek-v2"
EOF
3.验证#
3.1 检查工作负载#
确认预填充和解码 Pod 已启动:
kubectl get modelserving deepseek-v2-lite -n dev -o yaml | grep status -A 10
kubectl get pod -n dev -owide \
-l modelserving.volcano.sh/name=deepseek-v2-lite
您应该看到这两个角色都处于 Running 和 Ready 状态。
3.2 测试聊天端点#
路由配置完成后,您可以向 Kthena-router 发送测试请求:
export ENDPOINT=$(kubectl get svc kthena-router -n kthena-system --output=jsonpath='{.status.loadBalancer.ingress[0].ip}:{.spec.ports[0].port}')
curl --location "http://${ENDPOINT}/v1/chat/completions" \
--header "Content-Type: application/json" \
--data '{
"model": "deepseek-ai/DeepSeekV2",
"messages": [
{
"role": "user",
"content": "Where is the capital of China?"
}
],
"stream": false
}'
成功的 JSON 响应确认了:
预填充和解码服务都在昇腾 NPU 上运行。
它们之间的 KV 传输工作正常。
Kthena 路由层正确地为 vLLM-Ascend 插件提供了前端接入。
4.清理#
要移除部署:
# 1. Remove user-facing routing
kubectl delete modelroute deepseek-v2 -n dev
# 2. Remove internal server
kubectl delete modelserver deepseek-v2 -n dev
# 3. Remove workloads
kubectl delete modelserving deepseek-v2-lite -n dev
5.总结#
更多高级功能,请参考 Kthena 网站。