Skip to content

Advanced Configuration Options

To align the setup to your specific needs, you can use optional advanced configurations for running the vLLM server and benchmark. These configurations let you fine-tune performance, memory usage, and request handling using additional environment variables or configuration files. For most users, the basic setup is sufficient, but advanced users may benefit from these customizations.

Running vLLM Using Docker Compose with Custom Parameters

This configuration allows you to override the default settings by providing additional environment variables when starting the server. This allows fine-tuning for performance and memory usage.

The following table lists the available variables:

Variable Description
PT_HPU_LAZY_MODE Enables Lazy execution mode, potentially improving performance by batching operations.
Lazy mode is not the default. Lazy mode is compatible only with Intel Gaudi PyTorch fork.
VLLM_SKIP_WARMUP Skips the model warm-up phase to reduce startup time. It may affect initial latency.
MAX_MODEL_LEN Sets the maximum supported sequence length for the model.
MAX_NUM_SEQS Specifies the maximum number of sequences processed concurrently.
TENSOR_PARALLEL_SIZE Defines the degree of tensor parallelism.
VLLM_BUCKETING_STRATEGY Selects the bucketing strategy used for warm-up: exp, lin, or pad.
VLLM_DECODE_BLOCK_BUCKET_STEP Configures the step size for decode block allocation, affecting memory granularity.
VLLM_DECODE_BS_BUCKET_STEP Sets the batch size step for decode operations, impacting how decode batches are grouped.
VLLM_PROMPT_BS_BUCKET_STEP Adjusts the batch size step for prompt processing.
VLLM_PROMPT_QUERY_BUCKET_STEP Controls the step size for prompt query-length allocation.
EXTRA_ARGS Additional vLLM serve args for the server bringup e.g., " --served-model-name model_name"

Set the preferred variable when running the vLLM server using Docker Compose, as presented in the following example:

MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptupstream-2.10.0:latest" \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
docker compose up

Running vLLM and Benchmark with Custom Parameters

This configuration allows you to customize benchmark behavior by setting additional environment variables before running Docker Compose.

The following table lists the available variables:

Variable Description
INPUT_TOK Number of input tokens per prompt.
OUTPUT_TOK Number of output tokens to generate per prompt.
CON_REQ Number of concurrent requests during benchmarking.
NUM_PROMPTS Total number of prompts to use in the benchmark.
EXTRA_BENCH_ARGS Additional vLLM bench args e.g., " --tokenizer-mode hf"

Set the preferred variable when running the vLLM server using Docker Compose, as presented in the following example:

MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptupstream-2.10.0:latest" \
INPUT_TOK=128 \
OUTPUT_TOK=128 \
CON_REQ=16 \
NUM_PROMPTS=64 \
docker compose --profile benchmark up

This launches the vLLM server and runs the benchmark using your specified parameters.

Running vLLM and Benchmark with Combined Custom Parameters

This configuration allows you to launch the vLLM server and benchmark together. You can set any combination of server and benchmark-specific variables mentioned earlier. Set the preferred variable when running the vLLM server using Docker Compose, as presented in the following example:

MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptupstream-2.10.0:latest" \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
INPUT_TOK=128 \
OUTPUT_TOK=128 \
CON_REQ=16 \
NUM_PROMPTS=64 \
docker compose --profile benchmark up

This command starts the server and executes benchmarking with the provided configuration.

Running vLLM and Benchmark Using Configuration Files

This configuration allows you to configure the server and benchmark via YAML configuration files.

The following table lists the available environment variables:

Variable Description
VLLM_SERVER_CONFIG_FILE Path to the server config file inside the Docker container.
VLLM_SERVER_CONFIG_NAME Name of the server config section.
VLLM_BENCHMARK_CONFIG_FILE Path to the benchmark config file inside the container.
VLLM_BENCHMARK_CONFIG_NAME Name of the benchmark config section.

Set the preferred variable when running the vLLM server using Docker Compose, as presented in the following example:

HF_TOKEN=<your huggingface token> \
VLLM_SERVER_CONFIG_FILE=server/server_scenarios_text.yaml \
VLLM_SERVER_CONFIG_NAME=llama31_8b_instruct \
VLLM_BENCHMARK_CONFIG_FILE=benchmark/benchmark_scenarios_text.yaml \
VLLM_BENCHMARK_CONFIG_NAME=llama31_8b_instruct \
docker compose --profile benchmark up

Note

When using configuration files, you do not need to set the MODEL variable as the model details are included in the config files. However, the HF_TOKEN flag is still required.

Running vLLM Directly Using Docker

For maximum control, you can run the server directly using the docker run command, allowing full customization of Docker runtime settings, as in the following example:

docker run -it --rm \
    -e MODEL=$MODEL \
    -e HF_TOKEN=$HF_TOKEN \
    -e http_proxy=$http_proxy \
    -e https_proxy=$https_proxy \
    -e no_proxy=$no_proxy \
    --cap-add=sys_nice \
    --ipc=host \
    --runtime=habana \
    -e HABANA_VISIBLE_DEVICES=all \
    -p 8000:8000 \
    --name vllm-server \
    <docker image name>

This method provides full flexibility over how the vLLM server is executed within the container.

Dry Run to create vLLM sever and client command line

Set environment variable DRY_RUN=1
DRY_RUN env var set to 1 create a copy of vllm-server.sh or vllm-benchmark.sh command line file on the host machine, without launching the server or the client.

Example - Docker Compose

MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptupstream-2.10.0:latest" \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
DRY_RUN=1 \
docker compose up

Example - Docker Run

docker run -it --rm \
    -e MODEL=$MODEL \
    -e HF_TOKEN=$HF_TOKEN \
    -e http_proxy=$http_proxy \
    -e https_proxy=$https_proxy \
    -e no_proxy=$no_proxy \
    --cap-add=sys_nice \
    --ipc=host \
    --runtime=habana \
    -e HABANA_VISIBLE_DEVICES=all \
    -p 8000:8000 \
    -e DRY_RUN=1 \
    -v /tmp:/local \
    --name vllm-server \
    <docker image name>

Note

While launching the vLLM server using Docker Run command for Dry Run, make sure to mount /tmp directory as -v /tmp:/local. If user has write access to NFS, mount -v ${PWD}:/local instead of -v /tmp:/local. The command line files are saved at /tmp or PWD i.e. in the mounted volume directory.

Save vLLM sever and client log files

If vLLM server is launched using Docker Compose command, the log files are saved at /tmp by default.

If vLLM server is launched using Docker Run command, the user can save the log files by mounting /tmp as -v /tmp:/root/scripts/logs. If user has write access to NFS, mount -v ${PWD}:/root/scripts/logs instead of -v /tmp:/root/scripts/logs.

Create multiple vLLM services using Docker Compose

Set environment variables HOST_PORT and COMPOSE_PROJECT_NAME
Example

MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.24.0/ubuntu24.04/habanalabs/vllm-0.21.0-ptupstream-2.10.0:latest" \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
HOST_PORT=9000 \
COMPOSE_PROJECT_NAME=serv1 \
docker compose up

Note

The default values, when these vars not set, are HOST_PORT=8000 and COMPOSE_PROJECT_NAME=cd.

Pinning CPU Cores for Memory Access Coherence

To improve memory-access coherence and release CPUs to other CPU-only workloads, such as vLLM serving with Llama3 8B, you can pin CPU cores based on different CPU Non-Uniform Memory Access (NUMA) nodes using the automatically generated docker-compose.override.yml file. The following procedure explains the process.

The Xeon processors currently validated for this setup are: Intel Xeon 6960P and Intel Xeon PLATINUM 8568Y+.

  1. Install the required python libraries.

    pip install -r vllm-gaudi/.cd/server/cpu_binding/requirements_cpu_binding.txt
    
  2. Pin CPU cores using the docker-compose.override.yml file.

    export MODEL="Qwen/Qwen2.5-14B-Instruct"
    export HF_TOKEN="<your huggingface token>"
    export DOCKER_IMAGE="<docker image url>"
    python3 server/cpu_binding/generate_cpu_binding_from_csv.py --settings server/cpu_binding/cpu_binding_gnr.csv --output ./docker-compose.override.yml
    docker compose --profile benchmark up
    
  3. Specify the service name in docker-compose.override.yml to bind idle CPUs to another service, such as vllm-cpu-service, as in the following example:

    export MODEL="Qwen/Qwen2.5-14B-Instruct"
    export HF_TOKEN="<your huggingface token>"
    export DOCKER_IMAGE="<docker image url>"
    python3 server/cpu_binding/generate_cpu_binding_from_csv.py --settings server/cpu_binding/cpu_binding_gnr.csv --output ./docker-compose.override.yml --cpuservice vllm-cpu-service
    docker compose --profile benchmark -f docker-compose.yml -f docker-compose.vllm-cpu-service.yml -f docker-compose.override.yml up