使用 Volcano Kthena#

本指南介绍如何使用 vLLM-Ascend 在华为昇腾 NPU 上运行 预填充-解码 (PD) 分离 推理,并由 Kthena 在 Kubernetes 上负责编排。关于 Kthena 对 vLLM 的支持,请参考 使用 Kthena 部署 vLLM


1.什么是预填充-解码分离?#

大语言模型推理自然地分为两个阶段:

  • 预填充

    • 处理输入令牌并构建键值 (KV) 缓存。

    • 批处理友好、高吞吐量,非常适合 NPU 并行执行。

  • 解码

    • 消耗 KV 缓存来生成输出令牌。

    • 延迟敏感、内存密集,更具顺序性。

从客户端的角度来看,这仍然表现为一个统一的聊天/补全端点。


2.使用 Kthena 在 Kubernetes 上部署#

Kthena 是一个 Kubernetes 原生的 LLM 推理平台,它改变了组织在生产环境中部署和管理大语言模型的方式。它基于声明式模型生命周期管理和智能请求路由构建,为 LLM 推理工作负载提供高性能和企业级可扩展性。在本示例中,我们使用了三个关键的自定义资源定义 (CRD):

  • ModelServing — 定义工作负载(预填充和解码角色)。

  • ModelServer — 管理 PD 分组和内部路由。

  • ModelRoute — 暴露一个稳定的模型端点。

本节使用 deepseek-ai/DeepSeek-V2-Lite 作为示例,但您可以替换为 vLLM-Ascend 支持的任何模型。

2.1 前提条件#

  • 拥有昇腾 NPU 节点的 Kubernetes 集群:

    对应不同 NPU 驱动程序的资源名称可能略有不同。例如:

    • 如果使用 MindCluster,请使用 huawei.com/Ascend310Phuawei.com/Ascend910

    • 如果运行在华为云 CCE(云容器引擎)上且已安装 CCE AI 套件插件 (Ascend NPU),请使用 huawei.com/ascend-310huawei.com/ascend-1980

  • 已安装 Kthena。请遵循 Kthena 安装指南

2.2 在 Kubernetes 上部署预填充-解码分离的 DeepSeek-V2-Lite#

Kthena 提供了一个具体示例:volcano-sh/kthena

使用以下命令进行部署:

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/model-serving/prefill-decode-disaggregation.yaml

cat << EOF | kubectl apply -f -
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelServing
metadata:
  name: deepseek-v2-lite
  namespace: dev
spec:
  schedulerName: volcano
  replicas: 1
  recoveryPolicy: ServingGroupRecreate
  template:
    restartGracePeriodSeconds: 60
    roles:
      - name: prefill
        replicas: 1
        entryTemplate:
          spec:
            initContainers:
              - name: downloader
                imagePullPolicy: Always
                image: ghcr.io/volcano-sh/downloader:latest
                args:
                  - --source
                  - deepseek-ai/DeepSeek-V2-Lite
                  - --output-dir
                  - /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
                volumeMounts:
                  - name: models
                    mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
            containers:
              - name: runtime
                image: ghcr.io/volcano-sh/runtime:latest
                ports:
                  - containerPort: 8100
                args:
                  - --port
                  - "8100"
                  - --engine
                  - vllm
                  - --pod
                  - $(POD_NAME).$(NAMESPACE)
                  - --model
                  - deepseek-v2-lite
                  - --engine-base-url
                  - http://localhost:8000
              - name: vllm
                image: ghcr.io/volcano-sh/kthena-engine:vllm-ascend_v0.10.1rc1_mooncake_v0.3.5
                ports:
                  - containerPort: 8000
                env:
                  - name: HF_HUB_OFFLINE
                    value: "1"
                  - name: HCCL_IF_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: GLOO_SOCKET_IFNAME
                    value: eth0
                  - name: TP_SOCKET_IFNAME
                    value: eth0
                  - name: HCCL_SOCKET_IFNAME
                    value: eth0
                  - name: VLLM_LOGGING_LEVEL
                    value: DEBUG
                  - name: AscendRealDevices
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.annotations['huawei.com/AscendReal']
                args:
                  - "/mnt/cache/deepseek-ai/DeepSeek-V2-Lite/"
                  - "--served-model-name"
                  - "deepseek-ai/DeepSeekV2"
                  - "--tensor-parallel-size"
                  - "2"
                  - "--gpu-memory-utilization"
                  - "0.8"
                  - "--max-model-len"
                  - "8192"
                  - "--max-num-batched-tokens"
                  - "8192"
                  - "--trust-remote-code"
                  - "--enforce-eager"
                  - "--kv-transfer-config"
                  - '{"kv_connector":"MooncakeConnectorV1","kv_buffer_device":"npu","kv_role":"kv_producer","kv_parallel_size":1,"kv_port":"20001","engine_id":"0","kv_rank":0,"kv_connector_extra_config":{"prefill":{"dp_size":2,"tp_size":2},"decode":{"dp_size":2,"tp_size":2}}}'
                imagePullPolicy: Always
                resources:
                  limits:
                    cpu: "8"
                    memory: 64Gi
                    huawei.com/ascend-1980: "4"
                  requests:
                    cpu: "8"
                    memory: 64Gi
                    huawei.com/ascend-1980: "4"
                readinessProbe:
                  initialDelaySeconds: 5
                  periodSeconds: 5
                  failureThreshold: 3
                  httpGet:
                    path: /health
                    port: 8000
                livenessProbe:
                  initialDelaySeconds: 900
                  periodSeconds: 5
                  failureThreshold: 3
                  httpGet:
                    path: /health
                    port: 8000
                volumeMounts:
                  - name: models
                    mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
                    readOnly: true
                  - name: hccn-config
                    mountPath: /etc/hccn.conf
                    readOnly: true
                  - name: shared-memory-volume
                    mountPath: /dev/shm
            volumes:
              - name: models
                hostPath:
                  path: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
                  type: DirectoryOrCreate
              - name: hccn-config
                hostPath:
                  path: /etc/hccn.conf
                  type: File
              - name: shared-memory-volume
                emptyDir:
                  sizeLimit: 256Mi
                  medium: Memory
      - name: decode
        replicas: 1
        entryTemplate:
          spec:
            initContainers:
              - name: downloader
                imagePullPolicy: Always
                image: ghcr.io/volcano-sh/downloader:latest
                args:
                  - --source
                  - deepseek-ai/DeepSeek-V2-Lite
                  - --output-dir
                  - /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
                volumeMounts:
                  - name: models
                    mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
            containers:
              - name: vllm
                image: ghcr.io/volcano-sh/kthena-engine:vllm-ascend_v0.10.1rc1_mooncake_v0.3.5
                ports:
                  - containerPort: 8000
                env:
                  - name: HF_HUB_OFFLINE
                    value: "1"
                  - name: HCCL_IF_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: GLOO_SOCKET_IFNAME
                    value: eth0
                  - name: TP_SOCKET_IFNAME
                    value: eth0
                  - name: HCCL_SOCKET_IFNAME
                    value: eth0
                  - name: VLLM_LOGGING_LEVEL
                    value: DEBUG
                  - name: AscendRealDevices
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.annotations['huawei.com/AscendReal']
                args:
                  - "/mnt/cache/deepseek-ai/DeepSeek-V2-Lite/"
                  - "--served-model-name"
                  - "deepseek-ai/DeepSeekV2"
                  - "--tensor-parallel-size"
                  - "2"
                  - "--gpu-memory-utilization"
                  - "0.8"
                  - "--max-model-len"
                  - "8192"
                  - "--max-num-batched-tokens"
                  - "16384"
                  - "--trust-remote-code"
                  - "--no-enable-prefix-caching"
                  - "--enforce-eager"
                  - "--kv-transfer-config"
                  - '{"kv_connector":"MooncakeConnectorV1","kv_buffer_device":"npu","kv_role":"kv_consumer","kv_parallel_size":1,"kv_port":"20002","engine_id":"1","kv_rank":1,"kv_connector_extra_config":{"prefill":{"dp_size":2,"tp_size":2},"decode":{"dp_size":2,"tp_size":2}}}'
                imagePullPolicy: Always
                resources:
                  limits:
                    cpu: "8"
                    memory: 64Gi
                    huawei.com/ascend-1980: "4"
                  requests:
                    cpu: "8"
                    memory: 64Gi
                    huawei.com/ascend-1980: "4"
                readinessProbe:
                  initialDelaySeconds: 5
                  periodSeconds: 5
                  failureThreshold: 3
                  httpGet:
                    path: /health
                    port: 8000
                livenessProbe:
                  initialDelaySeconds: 900
                  periodSeconds: 5
                  failureThreshold: 3
                  httpGet:
                    path: /health
                    port: 8000
                volumeMounts:
                  - name: models
                    mountPath: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
                    readOnly: true
                  - name: hccn-config
                    mountPath: /etc/hccn.conf
                    readOnly: true
                  - name: shared-memory-volume
                    mountPath: /dev/shm
            volumes:
              - name: models
                hostPath:
                  path: /mnt/cache/deepseek-ai/DeepSeek-V2-Lite/
                  type: DirectoryOrCreate
              - name: hccn-config
                hostPath:
                  path: /etc/hccn.conf
                  type: File
              - name: shared-memory-volume
                emptyDir:
                  sizeLimit: 256Mi
                  medium: Memory
EOF

您应该会看到如下 Pod:

  • deepseek-v2-lite-0-prefill-0-0

  • deepseek-v2-lite-0-decode-0-0

为了启用 LLM 访问,我们仍需要通过 ModelServerModelRoute 配置路由层。

2.3 ModelServer:PD 分组管理#

ModelServer 资源:

  • 通过标签选择 ModelServing 工作负载。

  • 将预填充和解码 Pod 分组成 PD 对。

  • 配置 KV 连接器细节和超时时间。

  • 暴露一个内部的 gRPC/HTTP 接口。

使用以下命令创建 ModelServer:

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/kthena/refs/heads/main/examples/kthena-router/ModelServer-prefill-decode-disaggregation.yaml

cat << EOF | kubectl apply -f -
apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelServer
metadata:
  name: deepseek-v2
  namespace: dev
spec:
  kvConnector:
    type: nixl
  workloadSelector:
    matchLabels:
      modelserving.volcano.sh/name: deepseek-v2-lite
    pdGroup:
      groupKey: "modelserving.volcano.sh/group-name"
      prefillLabels:
        modelserving.volcano.sh/role: prefill
      decodeLabels:
        modelserving.volcano.sh/role: decode
  workloadPort:
    port: 8000
  model: "deepseek-ai/DeepSeekV2"
  inferenceEngine: "vLLM"
  trafficPolicy:
    timeout: 10s
EOF

2.4 ModelRoute:面向用户的端点#

ModelRoute 资源将模型名称(例如 "deepseek-ai/DeepSeekV2")映射到 ModelServer

清单示例:

cat << EOF | kubectl apply -f -
apiVersion: networking.serving.volcano.sh/v1alpha1
kind: ModelRoute
metadata:
  name: deepseek-v2
  namespace: dev
spec:
  modelName: "deepseek-ai/DeepSeekV2"
  rules:
    - name: "default"
      targetModels:
        - modelServerName: "deepseek-v2"
EOF

3.验证#

3.1 检查工作负载#

确认预填充和解码 Pod 已启动:

kubectl get modelserving deepseek-v2-lite -n dev -o yaml | grep status -A 10

kubectl get pod -n dev -owide \
  -l modelserving.volcano.sh/name=deepseek-v2-lite

您应该看到这两个角色都处于 RunningReady 状态。

3.2 测试聊天端点#

路由配置完成后,您可以向 Kthena-router 发送测试请求:

export ENDPOINT=$(kubectl get svc kthena-router -n kthena-system --output=jsonpath='{.status.loadBalancer.ingress[0].ip}:{.spec.ports[0].port}')

curl --location "http://${ENDPOINT}/v1/chat/completions" \
  --header "Content-Type: application/json" \
  --data '{
    "model": "deepseek-ai/DeepSeekV2",
    "messages": [
      {
        "role": "user",
        "content": "Where is the capital of China?"
      }
    ],
    "stream": false
  }'

成功的 JSON 响应确认了:

  • 预填充和解码服务都在昇腾 NPU 上运行。

  • 它们之间的 KV 传输工作正常。

  • Kthena 路由层正确地为 vLLM-Ascend 插件提供了前端接入。


4.清理#

要移除部署:

# 1. Remove user-facing routing
kubectl delete modelroute deepseek-v2 -n dev

# 2. Remove internal server
kubectl delete modelserver deepseek-v2 -n dev

# 3. Remove workloads
kubectl delete modelserving deepseek-v2-lite -n dev

5.总结#

更多高级功能,请参考 Kthena 网站