Architecture Overview¶

This document summarizes the vLLM Gaudi additions used to support single-process model swap. The baseline V1 architecture is described in the upstream vLLM architecture documentation; only the Gaudi-specific delta is covered here.

Entrypoints¶

vLLM provides multiple entrypoints for interacting with the system. For online inference, the model-swap feature uses a dedicated OpenAI-compatible Gaudi server entrypoint.

OpenAI-Compatible Gaudi API Server¶

The server can be launched directly via:

export VLLM_SERVER_DEV_MODE=1
export VLLM_ALLOW_INSECURE_SERIALIZATION=1
export VLLM_HPU_MULTI_MODEL_CONFIG=/path/to/multi_models.yaml
python -m vllm_gaudi.entrypoints.openai.multi_model_api_server

Implementation lives in vllm_gaudi/entrypoints/openai/multi_model_api_server.py.

New / Modified Components¶

Component	Type	Role in Model Swap
`vllm_gaudi.v1.engine.MultiModelAsyncLLM`	New manager wrapper	Owns multi-model configs, serializes swap requests, drains in-flight requests, and triggers in-process reconfigure
`MultiModelEngineClient`	Engine client adapter	Exposes the wrapped `AsyncLLM` through the standard server-facing `EngineClient` interface
`MultiModelServingModels`	OpenAI model registry adapter	Lists all configured models in `/v1/models`, while keeping request validation aligned with the currently active model
`install_engine_core_patch()`	Runtime patch installer	Injects `gaudi_reconfigure_engine()` into V1 `EngineCore` when `MultiModelAsyncLLM` is constructed
`EngineCore.gaudi_reconfigure_engine()`	Added utility method (patched)	Performs in-place runtime rebuild: pause/sleep, worker reload, KV cache re-init, scheduler/state reconstruction, resume
`HPUWorker.unload_model()` / `HPUWorker.load_model()`	Extended worker load/unload path	Stashes and restores model runner across switches, skipping warmup for repeat loads
`MultiModelAsyncLLM._refresh_engine_frontend_config()`	Frontend refresh step	Rebuilds frontend-side renderer and processors so request parsing/tokenization matches the newly active model

Control Plane Delta: Switch Flow¶

Caller side (`MultiModelAsyncLLM.switch_model`)¶

Acquire _switching_lock (single swap at a time).
Validate target model and skip no-op switches.
Drain pending requests (wait_for_requests_to_drain).
Serialize target VllmConfig with cloudpickle.
Invoke EngineCore utility: call_utility_async("gaudi_reconfigure_engine", serialized_config).
Refresh frontend-side AsyncLLM state (renderer, I/O processor, input processor, output processor).
Update local model sleep-state bookkeeping and active model pointer.

EngineCore side (`gaudi_reconfigure_engine`)¶

Deserialize new config.
Pause scheduler with cache reset (pause_scheduler(mode="abort", clear_cache=True)).
Sleep executor at level 1 to release device memory pressure.
Unload current worker model via collective RPC (unload_model).
Broadcast worker reload via collective RPC (load_model).
Recompute and initialize KV cache (_initialize_kv_caches, initialize_cache).
Rebuild scheduler-dependent runtime objects: - StructuredOutputManager - scheduler instance - KV connector handshake metadata - multimodal receiver cache - request block hasher and batch queue helpers
Reset executor sleep bookkeeping and resume scheduler.

State Rebuild Delta¶

The model swap path rebuilds runtime state that is model-shape or scheduler-policy dependent. This avoids carrying stale state across model boundaries (for example incompatible layer bindings or stale block-table assumptions).

Rebuilt state includes:

KV cache configuration and block counts
scheduler instance and block sizing
structured output manager
multimodal receiver cache
request block hashing setup
queueing/execution helper state (batch_queue, step_fn, abort queue)
frontend-side renderer and request processors used by the OpenAI server

API Behavior Notes¶

/v1/models returns all configured model aliases from the multi-model YAML file.
Inference requests are still served by the currently active model only.
/v1/models/switch is exposed only when VLLM_SERVER_DEV_MODE=1.
VLLM_ALLOW_INSECURE_SERIALIZATION=1 is required because the current in-process reconfigure path uses cloudpickle for internal config transfer. Enable this only for trusted/internal deployments.