Gateway Inference Extension

Gateway Inference Extension#

This tutorial guides you through setting up and using the Gateway Inference Extension in a production environment. The extension enables inference capabilities through the gateway, supporting both individual inference models and inference pools.

Prerequisites#

Before starting this tutorial, ensure you have:

A Kubernetes cluster with GPU nodes available
kubectl configured to access your cluster
helm installed
A Hugging Face account with API token
Basic understanding of Kubernetes concepts

Overview#

The Gateway Inference Extension provides:

Individual Model Inference: Direct access to specific models
Inference Pools: Load-balanced access to multiple model instances
Gateway API Integration: Standard Kubernetes Gateway API for routing
vLLM Integration: High-performance inference engine support

Step 1: Environment Setup#

1.1 Create Hugging Face Token Secret#

First, create a Kubernetes secret with your Hugging Face token:

# Replace <YOUR_HF_TOKEN> with your actual Hugging Face token
kubectl create secret generic hf-token --from-literal=token=<YOUR_HF_TOKEN>

1.2 Install Gateway API CRDs#

Install the required Custom Resource Definitions (CRDs):

# Install KGateway CRDs
KGTW_VERSION=v2.0.2
helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds

# Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml

# Install Gateway API inference extension CRDs
VERSION=v0.3.0
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/$VERSION/manifests.yaml

1.3 Install KGateway with Inference Extension#

# Install KGateway with inference extension enabled
helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true

Step 2: Deploy vLLM Models#

2.1 Understanding vLLM Runtime#

The vLLM Runtime is a custom resource that manages model deployments. Please check configs/vllm/gpu-deployment.yaml for an example config.

2.2 Apply vLLM Deployment#

# Apply the vLLM deployment configuration
kubectl apply -f configs/vllm/gpu-deployment.yaml

Production Considerations:

Adjust resource requests/limits based on your model size and GPU capacity
Consider using multiple replicas for high availability
Monitor GPU utilization and adjust accordingly

Step 3: Configure Inference Resources#

3.1 Individual Model Configuration#

Create an InferenceModel resource for direct model access:

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: legogpt
spec:
  modelName: legogpt
  criticality: Standard
  poolRef:
    name: vllm-llama3-1b-instruct
  targetModels:
  - name: legogpt
    weight: 100

3.2 Inference Pool Configuration#

For routing to multiple model instances, check configs/inferencepool-resources.yaml for example.

3.3 Apply Inference Resources#

# Apply individual model configuration
kubectl apply -f configs/inferencemodel.yaml

# Apply inference pool configuration
kubectl apply -f configs/inferencepool-resources.yaml

Step 4: Configure Gateway Routing#

4.1 Gateway Configuration#

The gateway acts as the entry point for inference requests:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: kgateway
  listeners:
  - name: http
    port: 80
    protocol: HTTP

4.2 HTTPRoute Configuration#

HTTPRoute defines how requests are routed to inference resources:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-route
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
  rules:
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: vllm-llama3-1b-instruct
    matches:
    - path:
        type: PathPrefix
        value: /

4.3 Apply Gateway Resources#

# Apply gateway configuration
kubectl apply -f configs/gateway/kgateway/gateway.yaml

# Apply HTTP route configuration
kubectl apply -f configs/httproute.yaml

Step 5: Testing the Setup#

5.1 Get Gateway IP Address#

# Get the external IP of the gateway
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=80

echo "Gateway IP: $IP"
echo "Gateway Port: $PORT"

5.2 Send Test Inference Request#

# Test with a simple completion request
curl -i http://${IP}:${PORT}/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "legogpt",
    "prompt": "Write as if you were a critic: San Francisco",
    "max_tokens": 100,
    "temperature": 0.5
  }'

5.3 Test Chat Completion#

# Test chat completion endpoint
curl -i http://${IP}:${PORT}/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "legogpt",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "max_tokens": 50,
    "temperature": 0.7
  }'

Step 6: Monitoring and Troubleshooting#

6.1 Check Resource Status#

# Check vLLM runtime status
kubectl get vllmruntime

# Check inference model status
kubectl get inferencemodel

# Check inference pool status
kubectl get inferencepool

# Check gateway status
kubectl get gateway

6.2 View Logs#

# Get vLLM runtime logs
kubectl logs -l app=vllm-runtime

# Get gateway logs
kubectl logs -n kgateway-system -l app=kgateway

Step 7: Uninstall#

To uninstall all the resources installed on the cluster, run the following:

# Delete the inference extension
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml --ignore-not-found=true

# Delete the inference model and pool resources
kubectl delete -f configs/inferencemodel.yaml --ignore-not-found=true
kubectl delete -f configs/inferencepool-resources.yaml --ignore-not-found=true

# Delete the VLLM deployment
kubectl delete -f configs/vllm/gpu-deployment.yaml --ignore-not-found=true
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml --ignore-not-found=true

# Delete helm releases
helm uninstall kgateway -n kgateway-system
helm uninstall kgateway-crds -n kgateway-system

# Delete the namespace last to ensure all resources are removed
kubectl delete ns kgateway-system --ignore-not-found=true