Advanced Configuration Options¶
To align the setup to your specific needs, you can use optional advanced configurations for running the vLLM server and benchmark. These configurations let you fine-tune performance, memory usage, and request handling using additional environment variables or configuration files. For most users, the basic setup is sufficient, but advanced users may benefit from these customizations.
Running vLLM Using Docker Compose with Custom Parameters¶
This configuration allows you to override the default settings by providing additional environment variables when starting the server. This allows fine-tuning for performance and memory usage.
The following table lists the available variables:
| Variable | Description |
|---|---|
PT_HPU_LAZY_MODE |
Enables Lazy execution mode, potentially improving performance by batching operations. |
| Lazy mode is not the default. Lazy mode is compatible only with Intel Gaudi PyTorch fork. | |
VLLM_SKIP_WARMUP |
Skips the model warm-up phase to reduce startup time. It may affect initial latency. |
MAX_MODEL_LEN |
Sets the maximum supported sequence length for the model. |
MAX_NUM_SEQS |
Specifies the maximum number of sequences processed concurrently. |
TENSOR_PARALLEL_SIZE |
Defines the degree of tensor parallelism. |
VLLM_BUCKETING_STRATEGY |
Selects the bucketing strategy used for warm-up: exp, lin, or pad. |
VLLM_DECODE_BLOCK_BUCKET_STEP |
Configures the step size for decode block allocation, affecting memory granularity. |
VLLM_DECODE_BS_BUCKET_STEP |
Sets the batch size step for decode operations, impacting how decode batches are grouped. |
VLLM_PROMPT_BS_BUCKET_STEP |
Adjusts the batch size step for prompt processing. |
VLLM_PROMPT_QUERY_BUCKET_STEP |
Controls the step size for prompt query-length allocation. |
EXTRA_ARGS |
Additional vLLM serve args for the server bringup e.g., " --served-model-name model_name" |
Set the preferred variable when running the vLLM server using Docker Compose, as presented in the following example:
MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptupstream-2.10.0:latest" \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
docker compose up
Running vLLM and Benchmark with Custom Parameters¶
This configuration allows you to customize benchmark behavior by setting additional environment variables before running Docker Compose.
The following table lists the available variables:
| Variable | Description |
|---|---|
INPUT_TOK |
Number of input tokens per prompt. |
OUTPUT_TOK |
Number of output tokens to generate per prompt. |
CON_REQ |
Number of concurrent requests during benchmarking. |
NUM_PROMPTS |
Total number of prompts to use in the benchmark. |
EXTRA_BENCH_ARGS |
Additional vLLM bench args e.g., " --tokenizer-mode hf" |
Set the preferred variable when running the vLLM server using Docker Compose, as presented in the following example:
MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptupstream-2.10.0:latest" \
INPUT_TOK=128 \
OUTPUT_TOK=128 \
CON_REQ=16 \
NUM_PROMPTS=64 \
docker compose --profile benchmark up
This launches the vLLM server and runs the benchmark using your specified parameters.
Running vLLM and Benchmark with Combined Custom Parameters¶
This configuration allows you to launch the vLLM server and benchmark together. You can set any combination of server and benchmark-specific variables mentioned earlier. Set the preferred variable when running the vLLM server using Docker Compose, as presented in the following example:
MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptupstream-2.10.0:latest" \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
INPUT_TOK=128 \
OUTPUT_TOK=128 \
CON_REQ=16 \
NUM_PROMPTS=64 \
docker compose --profile benchmark up
This command starts the server and executes benchmarking with the provided configuration.
Running vLLM and Benchmark Using Configuration Files¶
This configuration allows you to configure the server and benchmark via YAML configuration files.
The following table lists the available environment variables:
| Variable | Description |
|---|---|
VLLM_SERVER_CONFIG_FILE |
Path to the server config file inside the Docker container. |
VLLM_SERVER_CONFIG_NAME |
Name of the server config section. |
VLLM_BENCHMARK_CONFIG_FILE |
Path to the benchmark config file inside the container. |
VLLM_BENCHMARK_CONFIG_NAME |
Name of the benchmark config section. |
Set the preferred variable when running the vLLM server using Docker Compose, as presented in the following example:
HF_TOKEN=<your huggingface token> \
VLLM_SERVER_CONFIG_FILE=server/server_scenarios_text.yaml \
VLLM_SERVER_CONFIG_NAME=llama31_8b_instruct \
VLLM_BENCHMARK_CONFIG_FILE=benchmark/benchmark_scenarios_text.yaml \
VLLM_BENCHMARK_CONFIG_NAME=llama31_8b_instruct \
docker compose --profile benchmark up
Note
When using configuration files, you do not need to set the MODEL variable as the model details are included in the config files. However, the HF_TOKEN flag is still required.
Running vLLM Directly Using Docker¶
For maximum control, you can run the server directly using the docker run command, allowing full customization of Docker runtime settings, as in the following example:
docker run -it --rm \
-e MODEL=$MODEL \
-e HF_TOKEN=$HF_TOKEN \
-e http_proxy=$http_proxy \
-e https_proxy=$https_proxy \
-e no_proxy=$no_proxy \
--cap-add=sys_nice \
--ipc=host \
--runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-p 8000:8000 \
--name vllm-server \
<docker image name>
This method provides full flexibility over how the vLLM server is executed within the container.
Dry Run to create vLLM sever and client command line¶
Set environment variable DRY_RUN=1
DRY_RUN env var set to 1 create a copy of vllm-server.sh or vllm-benchmark.sh command line file on the host machine, without launching the server or the client.
Example - Docker Compose
MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptupstream-2.10.0:latest" \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
DRY_RUN=1 \
docker compose up
Example - Docker Run
docker run -it --rm \
-e MODEL=$MODEL \
-e HF_TOKEN=$HF_TOKEN \
-e http_proxy=$http_proxy \
-e https_proxy=$https_proxy \
-e no_proxy=$no_proxy \
--cap-add=sys_nice \
--ipc=host \
--runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-p 8000:8000 \
-e DRY_RUN=1 \
-v /tmp:/local \
--name vllm-server \
<docker image name>
Note
While launching the vLLM server using Docker Run command for Dry Run, make sure to mount /tmp directory as -v /tmp:/local. If user has write access to NFS, mount -v ${PWD}:/local instead of -v /tmp:/local.
The command line files are saved at /tmp or PWD i.e. in the mounted volume directory.
Save vLLM sever and client log files¶
If vLLM server is launched using Docker Compose command, the log files are saved at /tmp by default.
If vLLM server is launched using Docker Run command, the user can save the log files by mounting /tmp as -v /tmp:/root/scripts/logs. If user has write access to NFS, mount -v ${PWD}:/root/scripts/logs instead of -v /tmp:/root/scripts/logs.
Create multiple vLLM services using Docker Compose¶
Set environment variables HOST_PORT and COMPOSE_PROJECT_NAME
Example
MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptupstream-2.10.0:latest" \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
HOST_PORT=9000 \
COMPOSE_PROJECT_NAME=serv1 \
docker compose up
Note
The default values, when these vars not set, are HOST_PORT=8000 and COMPOSE_PROJECT_NAME=cd.
Pinning CPU Cores for Memory Access Coherence¶
To improve memory-access coherence and release CPUs to other CPU-only workloads, such as vLLM serving with Llama3 8B, you can pin CPU cores based on different CPU Non-Uniform Memory Access (NUMA) nodes using the automatically generated docker-compose.override.yml file. The following procedure explains the process.
The Xeon processors currently validated for this setup are: Intel Xeon 6960P and Intel Xeon PLATINUM 8568Y+.
-
Install the required python libraries.
-
Pin CPU cores using the
docker-compose.override.ymlfile.export MODEL="Qwen/Qwen2.5-14B-Instruct" export HF_TOKEN="<your huggingface token>" export DOCKER_IMAGE="<docker image url>" python3 server/cpu_binding/generate_cpu_binding_from_csv.py --settings server/cpu_binding/cpu_binding_gnr.csv --output ./docker-compose.override.yml docker compose --profile benchmark up -
Specify the service name in
docker-compose.override.ymlto bind idle CPUs to another service, such asvllm-cpu-service, as in the following example:export MODEL="Qwen/Qwen2.5-14B-Instruct" export HF_TOKEN="<your huggingface token>" export DOCKER_IMAGE="<docker image url>" python3 server/cpu_binding/generate_cpu_binding_from_csv.py --settings server/cpu_binding/cpu_binding_gnr.csv --output ./docker-compose.override.yml --cpuservice vllm-cpu-service docker compose --profile benchmark -f docker-compose.yml -f docker-compose.vllm-cpu-service.yml -f docker-compose.override.yml up