Environment Variables¶

This document lists the supported diagnostic and profiling, as well as performance tuning options.

Diagnostic and Profiling Parameters¶

Parameter name	Description	Default value
`VLLM_PROFILER_ENABLED`	Enables the high-level profiler. You can view resulting JSON traces at perfetto.habana.ai.	`false`
`VLLM_HPU_LOG_STEP_GRAPH_COMPILATION`	Logs graph compilations for each vLLM engine step, only when a compilation occurs. We recommend using it in conjunction with `PT_HPU_METRICS_GC_DETAILS=1`.	`false`
`VLLM_HPU_LOG_STEP_GRAPH_COMPILATION_ALL`	Logs graph compilations for every vLLM engine step, even if no compilation occurs.	`false`
`VLLM_HPU_LOG_STEP_CPU_FALLBACKS`	Logs CPU fallbacks for each vLLM engine step, only when a fallback occurs.	`false`
`VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL`	Logs CPU fallbacks for each vLLM engine step, even if no fallback occurs.	`false`
`VLLM_T_COMPILE_FULLGRAPH`	Forces the PyTorch compile function to raise an error if any graph breaks happen during compilation. This allows for the easy detection of existing graph breaks, which usually reduce performance.	`false`
`VLLM_T_COMPILE_DYNAMIC_SHAPES`	Forces PyTorch to compile graphs with disabled dynamic options to use dynamic shapes only when needed.	`false`
`VLLM_FULL_WARMUP`	Forces PyTorch to assume that the warm-up phase fully covers all possible tensor sizes, preventing further compilation. If compilation occurs after warm-up, PyTorch will crash (with this message: `Recompilation triggered with skip_guard_eval_unsafe stance. This usually means that you have not warmed up your model with enough inputs such that you can guarantee no more recompilations.`) and must be disabled.	`false`

Performance Tuning Parameters¶

Parameter name	Description	Default value
`VLLM_GRAPH_RESERVED_MEM`	Percentage of memory dedicated to HPUGraph capture.	`0.1`
`VLLM_BUCKETING_STRATEGY`	Selects the bucketing strategy: `exp`, `lin`, or `pad`.	`exp`
`VLLM_EXPONENTIAL_BUCKETING`	Deprecated compatibility flag. If set, it overrides `VLLM_BUCKETING_STRATEGY`: `true` forces `exp`, `false` forces `lin`. It cannot select `pad` and will be removed in a future release.	`None`
`VLLM_BUCKETING_FROM_FILE`	Enables reading bucket configuration from file.	`None`
`VLLM_ROW_PARALLEL_CHUNKS`	Number of chunks to split input into for pipelining matmul with all-reduce in RowParallelLinear layers. Setting to a value greater than 1 enables chunking. See Row-Parallel Chunking.	`1` (disabled)
`VLLM_ROW_PARALLEL_CHUNK_THRESHOLD`	Minimum number of tokens required to activate row-parallel chunking. Inputs below this threshold use the standard non-chunked path.	`8192`
`VLLM_PROMPT_BS_BUCKET_MAX`	Sets prefill batch size	`1`

Use VLLM_BUCKETING_STRATEGY=exp for the default exponential warm-up, VLLM_BUCKETING_STRATEGY=lin for explicitly configured linear ranges, or VLLM_BUCKETING_STRATEGY=pad for padding-aware ranges with absolute and relative padding limits.

Leave VLLM_EXPONENTIAL_BUCKETING unset when using VLLM_BUCKETING_STRATEGY. The legacy flag is checked for backward compatibility and still overrides the selected strategy when present.

Developer Mode Parameters¶

To enter developer mode use VLLM_DEVELOPER_MODE:

Parameter name	Description	Default value
`VLLM_SKIP_WARMUP`	Skips the warm-up phase.	`false`

Additional Parameters¶

Parameter name	Description	Default value
`VLLM_HANDLE_TOPK_DUPLICATES`	Handles duplicates outside top-k.	`false`
`VLLM_CONFIG_HIDDEN_LAYERS`	Sets the number of hidden layers to run per HPUGraph for model splitting among hidden layers when TP is 1. It improves throughput by reducing inter-token latency limitations in some models.	`1`
`VLLM_WORKER_MULTIPROC_METHOD`	Sets the Python `multiprocessing` start method used by the `mp` distributed executor backend when launching worker processes. The upstream default is `fork`. On HPU, it is automatically overridden to `spawn` with a warning because forked child processes inherit HPU driver state and can hang on exit. The override is applied when `--distributed-executor-backend` is `mp` or `uni`. With `uni`, no subprocess is created, so the value has no practical effect. With `external_launcher` and `ray`, workers are not started through Python `multiprocessing`, so the value is irrelevant. Set `VLLM_WORKER_MULTIPROC_METHOD=spawn` explicitly to suppress the auto-override warning, or set it to `fork` to opt out of the override, which is not recommended.	`spawn` on HPU (auto-overridden from upstream `fork`)

Heterogeneous KV Transfer (NIXL)¶

These variables control the NIXL KV-cache transfer path when an HPU prefill instance serves a GPU decode instance (disaggregated prefill/decode across different accelerators). They apply on the HPU prefill side and require VLLM_HPU_HETERO_KV_LAYOUT=true plus enable_permute_local_kv in the KV transfer config.

Parameter name	Description	Default value
`VLLM_HPU_NIXL_JOINT_KV`	Enables joint-KV staging so an HPU prefill instance can serve a GPU (FLASH_ATTN, blocks-first) decode instance. The GPU registers one joint `[K\\|V]` region per layer and reads V at `block_len // 2`; when enabled, the HPU stages transferred blocks into a matching joint buffer and advertises the same region model. Leave `false` for HPU-to-HPU disaggregation (separate K/V regions).	`false`
`VLLM_HPU_NIXL_STAGING_SLOTS`	Number of joint-KV staging slots (each holds one on-save block for all layers). `0` auto-sizes from `min(max_num_seqs * ceil(max_model_len / block_size_on_save), fraction_of_total_HPU_memory / per_slot_bytes)`. Set a positive value to override; must be identical on prefill scheduler and worker (same process, so a single value applies).	`0` (auto)

Distributed Executor Backend on HPU¶

vLLM exposes the --distributed-executor-backend CLI flag, also available as distributed_executor_backend in the Python API. On HPU, the relevant choices are:

mp: Python multiprocessing-based executor. It is used when world_size > 1 (that is, TP * PP * DP > 1), and each worker runs in its own subprocess. This backend honors VLLM_WORKER_MULTIPROC_METHOD. On HPU, the start method is forced to spawn to avoid teardown hangs caused by forking after HPU driver initialization. mp is the recommended backend for single-node, multi-card serving on Gaudi.
uni: In-process (uni-process) executor. It is selected automatically when world_size == 1 (typically TP=1, PP=1, DP=1), so no subprocess is started and the worker runs inside the engine process. VLLM_WORKER_MULTIPROC_METHOD has no effect on uni worker creation. However, the HPU platform still sets the environment variable so that engine-adjacent multiprocessing, such as LMCache helpers or plugins, also runs under spawn.
external_launcher: vLLM does not start any workers. Instead, the user is expected to launch all processes through an external tool such as torchrun, MPI, or SLURM. This option is available on HPU, but it is not commonly used.
ray: Ray-based executor. Workers run as Ray actors rather than through Python multiprocessing. Multi-node serving with Ray on Gaudi has not yet been validated by the Gaudi software product engineering team. Use mp for production deployments.

If the flag is not provided, vLLM selects the backend automatically: uni when world_size == 1, and mp otherwise on HPU.

HPU PyTorch bridge environment variables impacting vLLM execution:

Parameter name	Description	Default value
`PT_HPU_LAZY_MODE`	Sets the backend for Gaudi, with `0` for PyTorch Eager and `1` for PyTorch Lazy.	`0`
`PT_HPU_ENABLE_LAZY_COLLECTIVES`	Must be set to `true` for tensor parallel inference with HPU Graphs.	`true`
`PT_HPUGRAPH_DISABLE_TENSOR_CACHE`	Must be set to `false` for LLaVA, Qwen, and RoBERTa models.	`false`
`VLLM_PROMPT_USE_FLEX_ATTENTION`	Enabled only for the Llama model, allowing usage of `torch.nn.attention.flex_attention` instead of FusedSDPA. Requires `VLLM_PROMPT_USE_FUSEDSDPA=0`.	`false`
`RUNTIME_SCALE_PATCHING`	Enables the runtime scale patching feature, which applies only to FP8 execution and is ignored for BF16.	`true` (Torch Compile mode), `false` (Lazy mode)
`ENABLE_EXPERIMENTAL_FLAGS` and `ENABLE_SKIP_REMOVAL_OF_GRAPH_INPUT_IDENTITY_NODES`	Must both be set to `true` for Qwen3.5 (GDN hybrid) models to improve graph compilation performance.	`false`

Additional Performance Tuning Parameters for Bucketing Strategies¶

VLLM_{phase}_{dim}_BUCKET_{param} is a collection of environment variables configuring user-defined bucket ranges, where:

{phase} is in ['PROMPT', 'DECODE'].
{dim} is in ['BS', 'QUERY', 'CTX'] for PROMPT phase or in ['BS', 'BLOCK'] for DECODE phase.
{param} is in ['MIN', 'STEP', 'MAX'] for the lin strategy.
{param} is in ['MIN', 'STEP', 'MAX', 'PAD_MAX', 'PAD_PERCENT'] for the pad strategy.

The following table lists the available variables with their default values. PAD_MAX and PAD_PERCENT are used only when VLLM_BUCKETING_STRATEGY=pad.

Phase	Variable name	Default value
Prompt	batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`)	`1`
Prompt	batch size step (`VLLM_PROMPT_BS_BUCKET_STEP`)	`1`
Prompt	batch size max (`VLLM_PROMPT_BS_BUCKET_MAX`)	`max_num_prefill_seqs`
Prompt	batch size max abs padding (`VLLM_PROMPT_BS_BUCKET_PAD_MAX`)	`ceil(max_num_prefill_seqs / 4)`
Prompt	batch size max padding percent (`VLLM_PROMPT_BS_BUCKET_PAD_PERCENT`)	`25`
Prompt	query length min (`VLLM_PROMPT_QUERY_BUCKET_MIN`)	`block_size`
Prompt	query length step (`VLLM_PROMPT_QUERY_BUCKET_STEP`)	`block_size`
Prompt	query length max (`VLLM_PROMPT_QUERY_BUCKET_MAX`)	`max_num_batched_tokens`
Prompt	query length max abs padding (`VLLM_PROMPT_QUERY_BUCKET_PAD_MAX`)	`ceil(max_num_batched_tokens / 4)`
Prompt	query length max padding percent (`VLLM_PROMPT_QUERY_BUCKET_PAD_PERCENT`)	`25`
Prompt	sequence ctx min (`VLLM_PROMPT_CTX_BUCKET_MIN`)	`0`
Prompt	sequence ctx step (`VLLM_PROMPT_CTX_BUCKET_STEP`)	`2`
Prompt	sequence ctx max (`VLLM_PROMPT_CTX_BUCKET_MAX`)	`ceil((max_model_len - VLLM_PROMPT_QUERY_BUCKET_MIN) / block_size)`
Prompt	sequence ctx max abs padding (`VLLM_PROMPT_CTX_BUCKET_PAD_MAX`)	`ceil(max_num_batched_tokens / block_size)`
Prompt	sequence ctx max padding percent (`VLLM_PROMPT_CTX_BUCKET_PAD_PERCENT`)	`25`
Decode	batch size min (`VLLM_DECODE_BS_BUCKET_MIN`)	`1`
Decode	batch size step (`VLLM_DECODE_BS_BUCKET_STEP`)	`2`
Decode	batch size max (`VLLM_DECODE_BS_BUCKET_MAX`)	`max_num_seqs`
Decode	batch size max abs padding (`VLLM_DECODE_BS_BUCKET_PAD_MAX`)	`ceil(max_num_seqs / 4)`
Decode	batch size max padding percent (`VLLM_DECODE_BS_BUCKET_PAD_PERCENT`)	`25`
Decode	num blocks min (`VLLM_DECODE_BLOCK_BUCKET_MIN`)	`block_size`
Decode	num blocks step (`VLLM_DECODE_BLOCK_BUCKET_STEP`)	`block_size`
Decode	num blocks max (`VLLM_DECODE_BLOCK_BUCKET_MAX`)	`ceil(max_model_len * max_num_seqs / block_size)` by default or `max_blocks` if `VLLM_CONTIGUOUS_PA = True`
Decode	num blocks max abs padding (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`)	`ceil(VLLM_DECODE_BLOCK_BUCKET_MAX / 4)`
Decode	num blocks max padding percent (`VLLM_DECODE_BLOCK_BUCKET_PAD_PERCENT`)	`25`

VLLM_PROMPT_BS_BUCKET_MAX no longer affects only prompt warm-up coverage. It also affects the real prefill batch size used by the Gaudi runner.

The default value of 25 for VLLM_*_BUCKET_PAD_PERCENT is a balance of warmup duration and runtime performance. Using smaller value like 10 introduce more buckets and reduces the padding to get better runtime performance. Setting to 0 to fall back to the original linear bucketing with minimum padding. And setting to 50 is close to the exponential bucketing except for the corresponding VLLM_*_BUCKET_MIN is not 0 nor 1.

Legacy VLLM_PROMPT_SEQ_BUCKET_* variables are still accepted as a fallback for prompt query settings when VLLM_PROMPT_QUERY_BUCKET_* is not set, but this compatibility path is deprecated and will be removed in a future release.

When a deployed workload does not use the full context a model can handle, we recommend you to limit the maximum values upfront, based on the expected input and output token lengths that will be generated after serving the vLLM server. For example, suppose you want to deploy the text generation model Qwen2.5-1.5B with max_position_embeddings of 131072 (our max_model_len) and your workload pattern will not use the full context length (you expect the maximum input token size of 1K and predict generating the maximum of 2K tokens as output). In this case, starting the vLLM server to be ready for the full context length is unnecessary and you can limit the values upfront. It reduces the startup time and warm-up. Recommended settings for this case are:

--max_model_len: 3072, which is the sum of input and output sequences (1+2)*1024.
VLLM_PROMPT_QUERY_BUCKET_MAX: 1024, which is the maximum input token size that you expect to handle.

Note

If the model config specifies a high max_model_len, set it to the sum of input_tokens and output_tokens, rounded up to a multiple of block_size according to actual requirements.

Additional Performance Tuning Parameters for the FusedSDPA Kernel with Padding-Aware Bucketing¶

FusedSDPA can be split into smaller chunks to improve performance while using the padding-aware bucketing strategy which guarantees the max absolute padding in the sequence and context dimensions.

Parameter name	Description	Default value
`VLLM_HPU_FSDPA_SLICE_ENABLED`	Enable the slicing.	`True` when using padding-aware bucketing strategy with bucketing enabled, merged prefill disabled, and FusedSDPA kernel available
`VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD`	KV length threshold above which slicing is applied.	`min(max_num_batched_tokens, 8192)`
`VLLM_HPU_FSDPA_SLICE_CHUNK_SIZE`	Chunk size for `q_len` and `kv_len` in each chunk. Rounded up to the next multiple of 1024.	`VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD // 2`
`VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKS`	Places each chunk in a separate graph to reduce compilation time.	`true` for lazy mode and `false` otherwise

Note

These parameters are effective only with the padding-aware bucketing strategy set by VLLM_BUCKETING_STRATEGY="pad".

The slicing is only activated if all the following additional conditions are satisfied: - The batch size should be 1. - The query length and KV length should be different, i.e. the normal causal prefill will route to the default dispatch for better performance. - It's a causal attention model. - The padding side is 'right'. - No sliding window nor sinks (BF16 only; FP8 does not support sinks).