Long Context Configuration¶

Long context feature enables support for a token context window exceeding 128K tokens. It is supported by the following models:

Environment Variables Settings¶

Set the following environment variables to avoid OOM/functional issues. Additional environment variable settings depend on context length:

VLLM_ENGINE_ITERATION_TIMEOUT_S=3600
VLLM_RPC_TIMEOUT=100000
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

Warm-up Buckets Preparation¶

With VLLM_BUCKETING_STRATEGY=exp, long-context buckets are prepared automatically. The lin and pad strategies usually require manual bucket tuning for long-context workloads. The following table presents 32K context-length examples for manual warm-up tuning:

Flag	Suggested value	Notes
`VLLM_GRAPH_RESERVED_MEM`	`0.02` or `0.1`	It depends on the model and context length settings. Set to `0.02` for Llama3.1-8B or `0.1` for Llama3.1-70B.
`VLLM_PROMPT_QUERY_BUCKET_MIN`	`24576`	The value depends on the warm-up results.
`VLLM_PROMPT_QUERY_BUCKET_STEP`	`2048`	The value depends on the warm-up results. We recommend increasing it to a higher value for faster warm-up. For Intel Gaudi 3, we suggest setting it to `16384`.
`VLLM_PROMPT_QUERY_BUCKET_MAX`	`32768`	The value for context length is 32K; use 16384 for 16K.
`VLLM_DECODE_BLOCK_BUCKET_MIN`	`1024`	The value depends on the warm-up results.
`VLLM_DECODE_BLOCK_BUCKET_STEP`	`1024`	The value depends on the warm-up results.
`VLLM_DECODE_BLOCK_BUCKET_MAX`	`33792`	Calculate the value of `max_num_seqs * max_decode_seq // self.block_size`, where `max_decode_seq` represents the sum of input and output sequences. For example: `128 ((32 + 1) 1024) / 128` and `32 ((32 + 1) 1024) / 128`.

Legacy VLLM_PROMPT_SEQ_BUCKET_* aliases are still accepted for prompt query tuning, but VLLM_PROMPT_QUERY_BUCKET_* is the preferred naming.

Batch Size Settings¶

The default batch_size=256 setting is not optimal for long contexts (8K+). Recompilations may occur if there is not enough KV cache space for some sequence groups. If recompilation or next recomputation warnings appear during inference, reduce batch_size to improve stability.

An example of a recompilation message:

Configuration: (prompt, 1, 36864) was not warmed-up!

An example of a warning message:

Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory.