Supported Features¶

This document summarizes the features currently supported by the vLLM Hardware Plugin for Intel® Gaudi®, lists the features planned for future releases, and outlines the discontinued features with explanations for their deprecation.

Supported Features¶

Feature	Description	References
Offline batched inference	Supports offline inference using the LLM class from vLLM Python API.	Quickstart, Example
Online inference via the OpenAI-Compatible Server	Supports online inference through an HTTP server that implements the OpenAI Chat and Completions API.	Documentation, Example
HPU autodetection	Enables automatic target platform detection for HPU users at vLLM startup.	N/A
Paged KV cache with algorithms enabled for Intel® Gaudi® accelerators	Provides a custom paged attention and cache operators implementations optimized for Intel® Gaudi® devices.	N/A
Custom Intel® Gaudi® operator implementations	Provides optimized implementations of operators, such as prefill attention, Root Mean Square Layer Normalization, and Rotary Positional Encoding.	N/A
Tensor parallel inference	Supports multi-HPU inference with tensor parallelism and multiprocessing.	Documentation, HCCL reference
Inference with HPU Graphs	Reduces host overheads by using HPU Graphs, which record execution graphs ahead of time and replay them during inference.	Documentation
Inference with `torch.compile`	Supports inference with `torch.compile`, which is the default setting for HPU.	vLLM HPU backend execution modes
INC quantization	Supports the FP8 model, KV cache quantization, and calibration with Intel Neural Compressor (INC). This feature is not fully supported with the `torch.compile` execution mode.	Documentation
AutoAWQ quantization	Supports inference with models quantized using the AutoAWQ library.	Library
AutoGPTQ quantization	Supports inference with models quantized using the AutoGPTQ library.	Library
LoRA/MultiLoRA support	Supports LoRA and MultiLoRA on compatible models.	vLLM supported models, Example
Fully async model executor	Allows the model runner to run asynchronously with async scheduling, overlapping CPU operations , including `prepare_inputs`, and the model forward pass. It does not support speculative decoding, PP, or guided decoding. Expected speedup is 5-10% over the current async scheduling.	Feature description
Automatic Prefix Caching (APC)	Improves prefills efficiency. This feature is enabled by default.	Documentation
Speculative decoding (functional release)	Supports experimental speculative decoding, which improves inter-token latency in some scenarios. The feature is configurable via the standard `--speculative_model` and `--num_speculative_tokens` parameters. It is not fully supported with the `torch.compile` execution mode.	Documentation, Example
Multiprocessing backend	The default distributed runtime in vLLM.	Documentation
Multimodal	Supports inference for multi-modal models. It is not fully supported with the `t.compile` execution mode.	Documentation
Guided decode	Supports a guided decoding backend for generating structured outputs.	Documentation
Configurable bucketing strategies	Supports exponential (`exp`), linear (`lin`), and padding-aware (`pad`) bucketing strategies, selected with `VLLM_BUCKETING_STRATEGY`. The default is `exp`.	Bucketing Mechanism
Data Parallel support	Replicates model weights across multiple instances or GPUs to process independent request batches.	Documentation, Example
Row-Parallel Chunking	Overlaps computation with communication in RowParallelLinear layers by splitting input into chunks and launching async all-reduce operations. Improves throughput for tensor-parallel inference with long prefills. Configured via `VLLM_ROW_PARALLEL_CHUNKS` and `VLLM_ROW_PARALLEL_CHUNK_THRESHOLD` environment variables.	Documentation

Experimental Features¶

Runtime Scale Patching¶

Warm-up time for FP8 models is significantly longer than for BF16 due to additional graph compilations triggered by varying constant scale values in quantized model layers.

You can reduce the FP8 warm-up time by setting the RUNTIME_SCALE_PATCHING=1 environment variable and selecting a hardware-aligned per-tensor scale_method provided by the INC JSON config <json-options>. This feature is recommended for larger models, such as 70B and 405B. When combined with VLLM_BUCKETING_STRATEGY=exp for FP8 models, it can reduce warm-up time by up to 90%.

Note

This feature reduces FP8 warm-up time but may lower model throughput by 5-20%. Future releases will improve performance and extend support to more options. Currently, the feature is supported with Lazy mode (PT_HPU_LAZY_MODE=1) and torch.compile. It supports Llama workloads using FP8 execution of Linear and FSDPA layers, and casting ops between BF16 and FP8. MoE and Convolution options are not yet supported.

Trivial Scales Optimization¶

The PT_HPU_H2D_TRIVIAL_SCALES_MODE flag controls the optimization of trivial scales, such as scale values equal to 1.0, in the RUNTIME_SCALE_PATCHING mode. Enabling this optimization can increase warm-up and compilation time because additional graphs are generated, but it may improve runtime performance by reducing the number of multiplication operations.

The following values are supported:

0: No optimization (default).
1: Removes scales equal to 1.0 in cast_to_fp8_v2 and cast_from_fp8, disabling the corresponding mult_fwd (multiplication) node.
2: Applies the same optimization as mode 1, and additionally removes reciprocal scales in fp8_gemm_v2.

Dynamic Quantization for MatMul and KV‑cache Operations¶

This feature applies dynamic quantization to MatMul operations and KV-cache storage, improving performance with minimal expected impact on accuracy.

To enable the feature:

Set the environment variable:

export VLLM_DYNAMIC_KV_QUANT=1

Update your quantization configuration file with the following options:

"dynamicquantization": "True",
"scaleformat": "CONST"

Single-Process Model Swap¶

This feature enables sequential serving of multiple small models from one API server process without restart. The implementation uses a dedicated Gaudi OpenAI server entrypoint together with an in-process V1 engine reconfigure path.

To enable the feature:

Create a yaml file with models' config. For example:

default_model: llama
models:
    llama:
        model: meta-llama/Llama-3.1-8B-Instruct
        max_model_len: 4096
        tensor_parallel_size: 1
    qwen:
        model: Qwen/Qwen3-0.6B
        max_model_len: 4096
        tensor_parallel_size: 1

Set the required environment variables:

export VLLM_SERVER_DEV_MODE=1
export VLLM_ALLOW_INSECURE_SERIALIZATION=1
export VLLM_HPU_MULTI_MODEL_CONFIG=/path/to/multi_models.yaml

Launch the dedicated OpenAI-compatible entrypoint:
```
python -m vllm_gaudi.entrypoints.openai.multi_model_api_server --port 8080
```
See full online walkthrough in Single-Process Model Swap (Online Quickstart).

Verify the configured model aliases:

curl http://localhost:8080/v1/models | jq

Switch the active model in-process:

curl http://localhost:8080/v1/models/switch \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "drain_timeout": 60
  }' | jq

Notes:

/v1/models shows all configured aliases from the YAML file.
Inference requests are served by the currently active model only.
/v1/models/switch is intentionally gated behind VLLM_SERVER_DEV_MODE=1.
VLLM_ALLOW_INSECURE_SERIALIZATION=1 is required because the current in-process reconfigure path uses cloudpickle for internal config transfer. Enable this only for trusted/internal deployments.

Planned Features¶

Future plugin releases are planned to provide support for the following vLLM features:

Sliding window attention
P/D disaggregate support
In-place weight update
Multinode support
Pipeline parallel inference

Discontinued Features¶

Feature	Description	Reasoning
Multi-step scheduling	Multi-step scheduling support for host overhead reduction.	Replaced by async scheduling, configurable via the `--async_scheduling` parameter.
Delayed Sampling	Support for delayed sampling scheduling for asynchronous execution.	Replaced by async scheduling, configurable via the `--async_scheduling` parameter.