睡眠模式指南#
概述#
Sleep Mode 是一个用于卸载模型权重并清除 NPU 内存中 KV 缓存的 API。此功能对于强化学习(RL)后训练任务尤其重要,特别是在 PPO、GRPO 或 DPO 等在线算法中。在训练过程中,策略模型通常会使用像 vLLM 这样的推理引擎进行自回归生成,然后进行前向和反向传播以进行优化。
由于生成和训练阶段可能采用不同的模型并行策略,因此在训练过程中及时释放 KV 缓存,甚至卸载存储在 vLLM 内的模型参数变得至关重要。这可以确保内存的高效利用,并避免 NPU 上的资源争用。
快速上手#
With enable_sleep_mode=True, the way we manage memory (malloc, free) in vllm is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: {"weight": data, "kv_cache": data}.
The engine (v0/v1) supports two sleep levels to manage memory during idle periods:
一级睡眠
操作:卸载模型权重并清除KV缓存。
内存:模型权重被移动到CPU内存;KV缓存被清除。
用例:适用于之后需要重复使用同一个模型的情况。
注意:请确保有足够的CPU内存来存储模型权重。
二级睡眠
操作:同时丢弃模型权重和KV缓存。
Memory: The content of both the model weights and KV cache is forgotten.
用例:当切换到不同的模型或更新当前模型时非常理想。
Since this feature uses the low-level API AscendCL, in order to use sleep mode, you should follow the installation guide and build from source. If you are using v0.7.3, remember to set export COMPILE_CUSTOM_KERNELS=1. For the latest version (v0.11.0), the environment variable COMPILE_CUSTOM_KERNELS will be set to 1 by default while building from source.
用法#
以下是如何使用睡眠模式的一个简单示例。
Offline inference:
import os import torch from vllm import LLM, SamplingParams from vllm.utils import GiB_bytes os.environ["VLLM_USE_MODELSCOPE"] = "True" os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" if __name__ == "__main__": prompt = "How are you?" free, total = torch.npu.mem_get_info() print(f"Free memory before sleep: {free / 1024 ** 3:.2f} GiB") # record npu memory use baseline in case other process is running used_bytes_baseline = total - free llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True) sampling_params = SamplingParams(temperature=0, max_tokens=10) output = llm.generate(prompt, sampling_params) llm.sleep(level=1) free_npu_bytes_after_sleep, total = torch.npu.mem_get_info() print(f"Free memory after sleep: {free_npu_bytes_after_sleep / 1024 ** 3:.2f} GiB") used_bytes = total - free_npu_bytes_after_sleep - used_bytes_baseline # now the memory usage should be less than the model weights # (0.5B model, 1GiB weights) assert used_bytes < 1 * GiB_bytes llm.wake_up() output2 = llm.generate(prompt, sampling_params) # cmp output assert output[0].outputs[0].text == output2[0].outputs[0].text
Online serving:
备注
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the dev environment
VLLM_SERVER_DEV_MODEto expose these endpoints (sleep/wake up).export VLLM_SERVER_DEV_MODE="1" export VLLM_WORKER_MULTIPROC_METHOD="spawn" export VLLM_USE_MODELSCOPE="True" vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode # after serveing is up, post these endpoints # sleep level 1 curl -X POST http://127.0.0.1:8000/sleep \ -H "Content-Type: application/json" \ -d '{"level": "1"}' curl -X GET http://127.0.0.1:8000/is_sleeping # sleep level 2 curl -X POST http://127.0.0.1:8000/sleep \ -H "Content-Type: application/json" \ -d '{"level": "2"}' # wake up curl -X POST http://127.0.0.1:8000/wake_up # wake up with tag, tags must be in ["weights", "kv_cache"] curl -X POST "http://127.0.0.1:8000/wake_up?tags=weights" curl -X GET http://127.0.0.1:8000/is_sleeping # after sleep and wake up, the serving is still available curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "prompt": "The future of AI is", "max_tokens": 7, "temperature": 0 }'