Skip to content

Sleep Mode & ACK Protocol

vLLM-Omni’s Sleep Mode allows you to temporarily release most GPU memory used by a model—such as model weights and key-value (KV) caches—without stopping the server or unloading the Docker container.

This feature is inherited from vLLM’s Sleep Mode and extended with the Omni ACK Protocol to support multi-stage pipelines and heterogeneous hardware backends (NVIDIA, AMD, Intel, Huawei). It is especially useful in RLHF, dynamic model switching, or cost-saving scenarios.


1. Feature Documentation

Overview

Omni Sleep Mode provides a mechanism to "sleep" specific model stages. When a stage enters sleep, its physical VRAM is reclaimed by the system, while the process state is preserved for rapid "wake-up" without full re-initialization.

Sleep Levels

We support two levels of hibernation to balance recovery speed and memory efficiency:

Level Name Mechanism Recovery Speed Memory Freed
Level 1 Weight Offloading Offloads weights to Host CPU RAM. Fast (DMA) Substantial
Level 2 Full De-mapping Physically releases memory pages via VRAM scavenging. Moderate Maximum (up to 95%+)

Supported Platforms

Omni Sleep Mode is optimized for high-performance computing backends:

  • NVIDIA: Supported via Virtual Memory Management (VMM).
  • AMD (ROCm): Fully supported with physical page de-mapping.
  • Intel XPU: Supported via Level Zero memory management.
  • Huawei NPU: Supported via Ascend memory scavenging.

Hardware Requirements

  • Memory Considerations: System RAM must be sufficient to hold offloaded weights during sleep.
  • TP Support: Tensor Parallel groups synchronize sleep/wake transitions across all workers.

2. Usage Examples

Python API Example

You can programmatically control the lifecycle of stages using the AsyncOmni engine.

import asyncio
from vllm_omni.entrypoints.async_omni import AsyncOmni

async def run_sleep_demo():
    # 1. initialization
    engine = AsyncOmni(
        model="ByteDance-Seed/BAGEL-7B-MoT",
        enable_sleep_mode=True
    )

    # 2. sleep mode level2
    acks = await engine.sleep(stage_ids=[0], level=2)
    print(f"Freed {acks[0].freed_bytes / 1024**3:.2f} GiB on Stage 0")

    # 3. wake up
    await engine.wake_up(stage_ids=[0])

if __name__ == "__main__":
    asyncio.run(run_sleep_demo())

server command Example

Start the server with sleep mode enabled:

The first method

vllm serve ByteDance-Seed/BAGEL-7B-MoT \
--omni \
--enable-sleep-mode \
--trust-remote-code \
--gpu-memory-utilization 0.7

The second method

python3 -m vllm_omni.entrypoints.openai.api_server \
    --model ByteDance-Seed/BAGEL-7B-MoT \
    --omni \
    --enable-sleep-mode \
    --trust-remote-code \
--gpu-memory-utilization 0.7

Test Scenarios & Commands

Scenario 1: LLM Engine Sleep

Objective: Verify VRAM reclamation for Stage 0 (Thinker).

Trigger sleep (Level 1 or Level 2) via client:

curl -X POST http://localhost:8000/v1/omni/sleep \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [0], "level": 2}'

Tip: Open a new terminal and run rocm-smi or nvidia-smi or to observe the immediate drop in VRAM usage.

Scenario 2: Diffusion Sleep

Objective: Verify VRAM reclamation for Stage 1 (Diffusion).

Trigger sleep (Level 1 or Level 2) via client:

curl -X POST http://localhost:8000/v1/omni/sleep \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [1], "level": 2}'

Scenario 3: Multi-Stage Coordinated Stress Test

Objective: Test concurrent sleep and rapid wake-up across multiple stages.

Concurrent Sleep (Stage 0 & 1):

curl -X POST http://localhost:8000/v1/omni/sleep \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [0, 1], "level": 2}'

Rapid Wake-up:

curl -X POST http://localhost:8000/v1/omni/wakeup \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [0, 1]}'

Scenario 4: Full Lifecycle Memory Audit & Functional Integrity

Objective: Audit the complete flow from Sleep to Wake-up followed by an Inference validation.

Check Initial State: Observe baseline VRAM usage.

Trigger Deep Sleep (Level 2):

curl -X POST http://localhost:8000/v1/omni/sleep \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [0], "level": 2}'

Wake-up Model:

curl -X POST http://localhost:8000/v1/omni/wakeup \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [0]}'

Verify Functional Integrity (Inference): Ensure the model still generates valid output after reloading weights.

curl -X POST http://localhost:8000/v1/images/generations \
     -H "Content-Type: application/json" \
     -d '{
       "prompt": "A huge swimming pool, with many people swimming.",
       "model": "ByteDance-Seed/BAGEL-7B-MoT",
       "response_format": "b64_json",
       "extra_body": {"sampling_params": {"num_inference_steps": 4, "seed": 42}}
     }' > post.json

3. API Reference

Methods

Method Arguments Return Type Description
sleep stage_ids: List[int], level: int List[OmniACK] Triggers hibernation for specified stages.
wake_up stage_ids: List[int] List[OmniACK] Reloads weights and re-maps memory.

OmniACK Dataclass Fields

Field Type Description
task_id str Unique identifier for the operation.
status str SUCCESS or ERROR.
stage_id int The ID of the stage that responded.
rank int The rank ID within the Tensor Parallel group.
freed_bytes int Actual amount of physical VRAM reclaimed.
metadata dict Additional platform-specific metrics.

Metadata Field Analysis The metadata field is a dynamic dictionary containing hardware-specific telemetry and audit data, primarily used for verifying memory reclamation on various backends (e.g., AMD ROCm, NVIDIA CUDA).

"metadata": {
    "source": "Platform_AMD_Instinct_MI300X",
    "total_freed_gib": "78.57",
    "rank_residual_gib": "2.07"
}

Core Utility:

VRAM Reclamation Audit (total_freed_gib): Converts raw freed_bytes into human-readable GiB. It serves as the primary metric to verify that Level 2 sleep has successfully purged model weights from VRAM.

Residual & Fragmentation Monitoring (rank_residual_gib): Reports the remaining VRAM footprint after memory de-mapping. A low residual value (e.g., 2.07 GiB) confirms a successful "clean" state, ensuring the device is ready for high-memory co-located tasks like training or diffusion pipelines.

Backend Traceability (source): Identifies the underlying hardware driver or audit source. This is critical for debugging synchronization issues in multi-stage, distributed environments.

Performance Analytics (Roadmap): Future updates will include latency_ms (context-switch overhead) and cuda_graph_recalled (graph engine status) to optimize performance in high-frequency sleep/wake scenarios.