Hidden State Extraction¶

The Hidden State Extraction feature allows vLLM to save intermediate layer activations from a target model during inference. This is useful for training EAGLE-style draft models, knowledge distillation, or offline analysis of model internals.

Note

It is possible to save the last-layer's output hidden states by passing num_hidden_layers as a layer id. Note that these are not normalized using the output norm.

Offline Example¶

import tempfile

from vllm import LLM, SamplingParams
from vllm.config.kv_transfer import KVTransferConfig
from vllm.distributed.kv_transfer.kv_connector.v1 import (
    example_hidden_states_connector,
)

with tempfile.TemporaryDirectory() as tmpdir:
    llm = LLM(
        model="Qwen/Qwen3-8B",
        speculative_config={
            "method": "extract_hidden_states",
            "num_speculative_tokens": 1,
            "draft_model_config": {
                "hf_config": {
                    "eagle_aux_hidden_state_layer_ids": [1, 2, 3, 4],
                },
            },
        },
        kv_transfer_config=KVTransferConfig(
            kv_connector="ExampleHiddenStatesConnector",
            kv_role="kv_producer",
            kv_connector_extra_config={
                "shared_storage_path": tmpdir,
            },
        ),
    )

    outputs = llm.generate(
        ["The future of AI is"],
        SamplingParams(max_tokens=1),
    )

    for output in outputs:
        path = output.kv_transfer_params["hidden_states_path"]
        obj = example_hidden_states_connector.load_hidden_states(path)
        print(f"token_ids: {obj['token_ids'].shape}")
        print(f"hidden_states: {obj['hidden_states'].shape}")

A complete example is available at examples/features/speculative_decoding/extract_hidden_states_offline.py.

Online Example¶

For improved performance, it is recommended to use a RAM-mounted file system such as /dev/shm/ for online usage in which the client cleans up the files soon after they are generated.

vllm serve Qwen/Qwen3-8B \
    --speculative_config '{"method": "extract_hidden_states", "num_speculative_tokens": 1, "draft_model_config": {"hf_config": {"eagle_aux_hidden_state_layer_ids": [1, 2, 3, 4]}}}' \
    --kv_transfer_config '{"kv_connector": "ExampleHiddenStatesConnector", "kv_role": "kv_producer", "kv_connector_extra_config": {"shared_storage_path": "/dev/shm/hidden_states"}}'

Per-Request Options¶

Both offline and online modes support per-request options via kv_transfer_params:

Parameter	Default	Description
`hidden_states_path`	Auto-generated	Custom file path for saving hidden states. If not set, files are saved to `<shared_storage_path>/<request_id>.safetensors`. Requires `allow_custom_save_path` to be enabled in the server config.
`include_output_tokens`	`False`	When `True`, save hidden states for both prompt and generated output tokens. When `False`, only prompt token hidden states are saved.

Offline usage¶

Pass per-request options via extra_args on SamplingParams:

SamplingParams(
    max_tokens=32,
    extra_args={
        "kv_transfer_params": {
            "hidden_states_path": "/tmp/my_output.safetensors",
            "include_output_tokens": True,
        }
    },
)

Online usage¶

Pass kv_transfer_params as a top-level field in the API request:

{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 32,
    "kv_transfer_params": {
        "hidden_states_path": "/tmp/my_output.safetensors",
        "include_output_tokens": true
    }
}

Configuration¶

The kv_connector_extra_config dict accepts these server-level options:

Parameter	Default	Description
`shared_storage_path`	`/tmp`	Directory where hidden state files are saved (used when `hidden_states_path` is not set per-request)
`allow_custom_save_path`	`False`	Allow API clients to specify custom file paths via `hidden_states_path`. When disabled, client-provided paths are ignored with a warning. Enable only with trusted clients — custom paths can write to arbitrary locations on the server.
`num_writer_threads`	`8`	Thread pool size for async disk writes
`use_synchronization_lock`	`True`	Use file locks so concurrent readers block until writes complete. Can be disabled for batch generation where synchronization is not needed.

Output Format¶

Each request produces a .safetensors file containing:

hidden_states — shape [num_tokens, num_extracted_layers, hidden_size]
token_ids — shape [num_tokens]

The file path is returned in output.kv_transfer_params["hidden_states_path"]. Use load_hidden_states() from the connector module to read the file with proper synchronization.

Note

Chunked prefill is not compatible with this feature and must be disabled.