Single-Process Model Swap (Online Quickstart)¶
This quickstart shows an end-to-end online flow for serving multiple small models sequentially in one Gaudi server process, without restarting the API server between model changes.
When to Use¶
Use this mode when:
- You need to switch model A → model B without server restart.
- Your workload is sequential and each model fits the available device budget when loaded.
Do not use this mode as a replacement for multi-process or multi-node orchestration.
Prerequisites¶
- vLLM and vLLM Gaudi plugin installed.
- Multi-model config file, for example:
default_model: llama
models:
llama:
model: meta-llama/Llama-3.1-8B-Instruct
tensor_parallel_size: 1
max_model_len: 4096
enable_auto_tool_choice: false
qwen:
model: Qwen/Qwen3-0.6B
tensor_parallel_size: 1
max_model_len: 4096
enable_auto_tool_choice: true
tool_call_parser: hermes
Start Server¶
export VLLM_SERVER_DEV_MODE=1
export VLLM_ALLOW_INSECURE_SERIALIZATION=1
export VLLM_HPU_MULTI_MODEL_CONFIG=/path/to/multi_models.yaml
python -m vllm_gaudi.entrypoints.openai.multi_model_api_server \
--host 0.0.0.0 \
--port 8080
Notes:
- This entrypoint reads configured model aliases from
VLLM_HPU_MULTI_MODEL_CONFIG. /v1/modelslists every configured alias, but generation requests are handled by the currently active model only./v1/models/switchis available only whenVLLM_SERVER_DEV_MODE=1.VLLM_ALLOW_INSECURE_SERIALIZATION=1is currently required because the in-process reconfigure hook usescloudpickleinternally. Use this mode only in trusted/internal deployments.- Frontend settings can now be set per model in the YAML config for
enable_auto_tool_choice,tool_call_parser, andchat_template. - Per-model
chat_templatevalues can be absolute paths or paths relative to the multi-model config file. - Per-model
quant_configpath can be specified to modifyQUANT_CONFIGenv variable. - If frontend settings are absent per model, the server falls back to the corresponding CLI values.
- If
quant_configis absent for a model, the existingQUANT_CONFIGenvironment value is preserved. - Set
quant_config: nullfor a model to explicitly clearQUANT_CONFIGwhen that model is active.
Online Flow (Smoke Test)¶
- List available models:
- Generate with default model:
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [{"role": "user", "content": "Explain Intel Gaudi in one sentence."}],
"max_tokens": 64,
"temperature": 0
}' | jq
The model field should match the active model. After a successful switch, use the new model alias in subsequent requests.
- Switch model in-process:
curl -s http://localhost:8080/v1/models/switch \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"drain_timeout": 60
}' | jq
- Generate with switched model:
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"messages": [{"role": "user", "content": "Explain Intel Gaudi in one sentence."}],
"max_tokens": 64,
"temperature": 0
}' | jq
Rollback¶
To disable this mode, unset multi-model env flag and use standard serving: