Architecture Overview¶
This document summarizes the vLLM Gaudi additions used to support single-process model swap. The baseline V1 architecture is described in the upstream vLLM architecture documentation; only the Gaudi-specific delta is covered here.
Entrypoints¶
vLLM provides multiple entrypoints for interacting with the system. For online inference, the model-swap feature uses a dedicated OpenAI-compatible Gaudi server entrypoint.
OpenAI-Compatible Gaudi API Server¶
The server can be launched directly via:
export VLLM_SERVER_DEV_MODE=1
export VLLM_ALLOW_INSECURE_SERIALIZATION=1
export VLLM_HPU_MULTI_MODEL_CONFIG=/path/to/multi_models.yaml
python -m vllm_gaudi.entrypoints.openai.multi_model_api_server
Implementation lives in vllm_gaudi/entrypoints/openai/multi_model_api_server.py.
New / Modified Components¶
| Component | Type | Role in Model Swap |
|---|---|---|
vllm_gaudi.v1.engine.MultiModelAsyncLLM |
New manager wrapper | Owns multi-model configs, serializes swap requests, drains in-flight requests, and triggers in-process reconfigure |
MultiModelEngineClient |
Engine client adapter | Exposes the wrapped AsyncLLM through the standard server-facing EngineClient interface |
MultiModelServingModels |
OpenAI model registry adapter | Lists all configured models in /v1/models, while keeping request validation aligned with the currently active model |
install_engine_core_patch() |
Runtime patch installer | Injects gaudi_reconfigure_engine() into V1 EngineCore when MultiModelAsyncLLM is constructed |
EngineCore.gaudi_reconfigure_engine() |
Added utility method (patched) | Performs in-place runtime rebuild: pause/sleep, worker reload, KV cache re-init, scheduler/state reconstruction, resume |
HPUWorker.unload_model() / HPUWorker.load_model() |
Extended worker load/unload path | Stashes and restores model runner across switches, skipping warmup for repeat loads |
MultiModelAsyncLLM._refresh_engine_frontend_config() |
Frontend refresh step | Rebuilds frontend-side renderer and processors so request parsing/tokenization matches the newly active model |
Control Plane Delta: Switch Flow¶
Caller side (MultiModelAsyncLLM.switch_model)¶
- Acquire
_switching_lock(single swap at a time). - Validate target model and skip no-op switches.
- Drain pending requests (
wait_for_requests_to_drain). - Serialize target
VllmConfigwithcloudpickle. - Invoke EngineCore utility:
call_utility_async("gaudi_reconfigure_engine", serialized_config). - Refresh frontend-side
AsyncLLMstate (renderer, I/O processor, input processor, output processor). - Update local model sleep-state bookkeeping and active model pointer.
EngineCore side (gaudi_reconfigure_engine)¶
- Deserialize new config.
- Pause scheduler with cache reset (
pause_scheduler(mode="abort", clear_cache=True)). - Sleep executor at level 1 to release device memory pressure.
- Unload current worker model via collective RPC (
unload_model). - Broadcast worker reload via collective RPC (
load_model). - Recompute and initialize KV cache (
_initialize_kv_caches,initialize_cache). - Rebuild scheduler-dependent runtime objects:
-
StructuredOutputManager- scheduler instance - KV connector handshake metadata - multimodal receiver cache - request block hasher and batch queue helpers - Reset executor sleep bookkeeping and resume scheduler.
State Rebuild Delta¶
The model swap path rebuilds runtime state that is model-shape or scheduler-policy dependent. This avoids carrying stale state across model boundaries (for example incompatible layer bindings or stale block-table assumptions).
Rebuilt state includes:
- KV cache configuration and block counts
- scheduler instance and block sizing
- structured output manager
- multimodal receiver cache
- request block hashing setup
- queueing/execution helper state (
batch_queue,step_fn, abort queue) - frontend-side renderer and request processors used by the OpenAI server
API Behavior Notes¶
/v1/modelsreturns all configured model aliases from the multi-model YAML file.- Inference requests are still served by the currently active model only.
/v1/models/switchis exposed only whenVLLM_SERVER_DEV_MODE=1.VLLM_ALLOW_INSECURE_SERIALIZATION=1is required because the current in-process reconfigure path usescloudpicklefor internal config transfer. Enable this only for trusted/internal deployments.