Sleep Mode & ACK Protocol¶

vLLM-Omni’s Sleep Mode allows you to temporarily release most GPU memory used by a model—such as model weights and key-value (KV) caches—without stopping the server or unloading the Docker container.

This feature is inherited from vLLM’s Sleep Mode and extended with the Omni ACK Protocol to support multi-stage pipelines and heterogeneous hardware backends (NVIDIA, AMD, Intel, Huawei). It is especially useful in RLHF, dynamic model switching, or cost-saving scenarios.

1. Feature Documentation¶

Overview¶

Omni Sleep Mode provides a mechanism to "sleep" specific model stages. When a stage enters sleep, its physical VRAM is reclaimed by the system, while the process state is preserved for rapid "wake-up" without full re-initialization.

Sleep Levels¶

We support two levels of hibernation to balance recovery speed and memory efficiency:

Level	Name	Mechanism	Recovery Speed	Memory Freed
Level 1	Weight Offloading	Offloads weights to Host CPU RAM.	Fast (DMA)	Substantial
Level 2	Full De-mapping	Physically releases memory pages via VRAM scavenging.	Moderate	Maximum (up to 95%+)

Supported Platforms¶

Omni Sleep Mode is optimized for high-performance computing backends:

NVIDIA: Supported via Virtual Memory Management (VMM).
AMD (ROCm): Fully supported with physical page de-mapping.
Intel XPU: Supported via Level Zero memory management.
Huawei NPU: Supported via Ascend memory scavenging.

Hardware Requirements¶

Memory Considerations: System RAM must be sufficient to hold offloaded weights during sleep.
TP Support: Tensor Parallel groups synchronize sleep/wake transitions across all workers.

2. Usage Examples¶

Python API Example¶

You can programmatically control the lifecycle of stages using the AsyncOmni engine.

import asyncio
from vllm_omni.entrypoints.async_omni import AsyncOmni

async def run_sleep_demo():
    # 1. initialization
    engine = AsyncOmni(
        model="ByteDance-Seed/BAGEL-7B-MoT",
        enable_sleep_mode=True
    )

    # 2. sleep mode level2
    acks = await engine.sleep(stage_ids=[0], level=2)
    print(f"Freed {acks[0].freed_bytes / 1024**3:.2f} GiB on Stage 0")

    # 3. wake up
    await engine.wake_up(stage_ids=[0])

if __name__ == "__main__":
    asyncio.run(run_sleep_demo())

server command Example¶

Start the server with sleep mode enabled:

The first method

vllm serve ByteDance-Seed/BAGEL-7B-MoT \
--omni \
--enable-sleep-mode \
--trust-remote-code \
--gpu-memory-utilization 0.7

The second method

python3 -m vllm_omni.entrypoints.openai.api_server \
    --model ByteDance-Seed/BAGEL-7B-MoT \
    --omni \
    --enable-sleep-mode \
    --trust-remote-code \
--gpu-memory-utilization 0.7

Test Scenarios & Commands¶

Scenario 1: LLM Engine Sleep¶

Objective: Verify VRAM reclamation for Stage 0 (Thinker).

Trigger sleep (Level 1 or Level 2) via client:

curl -X POST http://localhost:8000/v1/omni/sleep \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [0], "level": 2}'

Tip: Open a new terminal and run rocm-smi or nvidia-smi or to observe the immediate drop in VRAM usage.

Scenario 2: Diffusion Sleep¶

Objective: Verify VRAM reclamation for Stage 1 (Diffusion).

Trigger sleep (Level 1 or Level 2) via client:

curl -X POST http://localhost:8000/v1/omni/sleep \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [1], "level": 2}'

Scenario 3: Multi-Stage Coordinated Stress Test¶

Objective: Test concurrent sleep and rapid wake-up across multiple stages.

Concurrent Sleep (Stage 0 & 1):

curl -X POST http://localhost:8000/v1/omni/sleep \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [0, 1], "level": 2}'

Rapid Wake-up:

curl -X POST http://localhost:8000/v1/omni/wakeup \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [0, 1]}'

Scenario 4: Full Lifecycle Memory Audit & Functional Integrity¶

Objective: Audit the complete flow from Sleep to Wake-up followed by an Inference validation.

Check Initial State: Observe baseline VRAM usage.

Trigger Deep Sleep (Level 2):

curl -X POST http://localhost:8000/v1/omni/sleep \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [0], "level": 2}'

Wake-up Model:

curl -X POST http://localhost:8000/v1/omni/wakeup \
     -H "Content-Type: application/json" \
     -d '{"stage_ids": [0]}'

Verify Functional Integrity (Inference): Ensure the model still generates valid output after reloading weights.

curl -X POST http://localhost:8000/v1/images/generations \
     -H "Content-Type: application/json" \
     -d '{
       "prompt": "A huge swimming pool, with many people swimming.",
       "model": "ByteDance-Seed/BAGEL-7B-MoT",
       "response_format": "b64_json",
       "extra_body": {"sampling_params": {"num_inference_steps": 4, "seed": 42}}
     }' > post.json

3. API Reference¶

Methods¶

Method	Arguments	Return Type	Description
sleep	`stage_ids: List[int], level: int`	`List[OmniACK]`	Triggers hibernation for specified stages.
wake_up	`stage_ids: List[int]`	`List[OmniACK]`	Reloads weights and re-maps memory.

OmniACK Dataclass Fields¶

Field	Type	Description
task_id	`str`	Unique identifier for the operation.
status	`str`	`SUCCESS` or `ERROR`.
stage_id	`int`	The ID of the stage that responded.
rank	`int`	The rank ID within the Tensor Parallel group.
freed_bytes	`int`	Actual amount of physical VRAM reclaimed.
metadata	`dict`	Additional platform-specific metrics.

Metadata Field Analysis The metadata field is a dynamic dictionary containing hardware-specific telemetry and audit data, primarily used for verifying memory reclamation on various backends (e.g., AMD ROCm, NVIDIA CUDA).

"metadata": {
    "source": "Platform_AMD_Instinct_MI300X",
    "total_freed_gib": "78.57",
    "rank_residual_gib": "2.07"
}

Core Utility:¶

VRAM Reclamation Audit (total_freed_gib): Converts raw freed_bytes into human-readable GiB. It serves as the primary metric to verify that Level 2 sleep has successfully purged model weights from VRAM.

Residual & Fragmentation Monitoring (rank_residual_gib): Reports the remaining VRAM footprint after memory de-mapping. A low residual value (e.g., 2.07 GiB) confirms a successful "clean" state, ensuring the device is ready for high-memory co-located tasks like training or diffusion pipelines.

Backend Traceability (source): Identifies the underlying hardware driver or audit source. This is critical for debugging synchronization issues in multi-stage, distributed environments.

Performance Analytics (Roadmap): Future updates will include latency_ms (context-switch overhead) and cuda_graph_recalled (graph engine status) to optimize performance in high-frequency sleep/wake scenarios.