Reasoning Outputs¶

vLLM offers support for reasoning models like DeepSeek R1, which are designed to generate outputs containing both reasoning steps and final conclusions.

Reasoning models return an additional reasoning field in their outputs, which contains the reasoning steps that led to the final conclusion. This field is not present in the outputs of other models.

Warning

reasoning used to be called reasoning_content. To migrate, directly replace reasoning_content with reasoning.

Supported Models¶

vLLM currently supports the following reasoning models:

Model Series	Parser Name	Structured Output Support	Tool Calling
Cohere Command A Reasoning	`cohere_command3`	`json`, `regex`	✅
DeepSeek R1 series	`deepseek_r1`	`json`, `regex`	❌
Gemma 4 series	`gemma4`	`json`, `regex`	✅
DeepSeek-V3.1	`deepseek_v3`	`json`, `regex`	❌
ERNIE-4.5-VL series	`ernie45`	`json`, `regex`	❌
ERNIE-4.5-21B-A3B-Thinking	`ernie45`	`json`, `regex`	✅
GLM-4.5 series	`glm45`	`json`, `regex`	✅
Holo2 series	`holo2`	`json`, `regex`	✅
Hunyuan A13B series	`hunyuan_a13b`	`json`, `regex`	✅
IBM Granite 3.2 language models	`granite`	❌	❌
MiniMax-M2	`minimax_m2_append_think`	`json`, `regex`	✅
Qwen3 series	`qwen3`	`json`, `regex`	✅
QwQ-32B	`deepseek_r1`	`json`, `regex`	✅

Note

IBM Granite 3.2 and DeepSeek-V3.1 reasoning is disabled by default; to enable it, you must also pass thinking=True in your chat_template_kwargs. The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass enable_thinking=False in your chat_template_kwargs. Gemma 4 reasoning is disabled by default; to enable it, pass enable_thinking=True in your chat_template_kwargs or set reasoning_effort (which enables it automatically). DeepSeek-V3.1 tool calling is supported in non-thinking mode. Holo2 reasoning is enabled by default. To disable it, you must also pass thinking=False in your chat_template_kwargs.

Quickstart¶

To use reasoning models, you need to specify the --reasoning-parser flags when making a request to the chat completion endpoint. The --reasoning-parser flag specifies the reasoning parser to use for extracting reasoning content from the model output.

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --reasoning-parser deepseek_r1

Next, make a request to the model that should return the reasoning content in the response.

Code

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Round 1
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
# For Qwen3 series, if you want to disable thinking in reasoning mode, add:
# extra_body={"chat_template_kwargs": {"enable_thinking": False}}
response = client.chat.completions.create(model=model, messages=messages)

reasoning = response.choices[0].message.reasoning
content = response.choices[0].message.content

print("reasoning:", reasoning)
print("content:", content)

The reasoning field contains the reasoning steps that led to the final conclusion, while the content field contains the final conclusion.

Streaming chat completions¶

Streaming chat completions are also supported for reasoning models. The reasoning field is available in the delta field in chat completion response chunks.

Json

{
    "id": "chatcmpl-123",
    "object": "chat.completion.chunk",
    "created": 1694268190,
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    "system_fingerprint": "fp_44709d6fcb",
    "choices": [
        {
            "index": 0,
            "delta": {
                "role": "assistant",
                "reasoning": "is",
            },
            "logprobs": null,
            "finish_reason": null
        }
    ]
}

OpenAI Python client library does not officially support reasoning attribute for streaming output. But the client supports extra attributes in the response. You can use hasattr to check if the reasoning attribute is present in the response. For example:

Code

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
# For Qwen3 series, if you want to disable thinking in reasoning mode, add:
# extra_body={"chat_template_kwargs": {"enable_thinking": False}}
stream = client.chat.completions.create(
    model=model,
    messages=messages,
    stream=True,
)

print("client: Start streaming chat completions...")
printed_reasoning = False
printed_content = False

for chunk in stream:
    # Safely extract reasoning and content from delta,
    # defaulting to None if attributes don't exist or are empty strings
    reasoning = (
        getattr(chunk.choices[0].delta, "reasoning", None) or None
    )
    content = getattr(chunk.choices[0].delta, "content", None) or None

    if reasoning is not None:
        if not printed_reasoning:
            printed_reasoning = True
            print("reasoning:", end="", flush=True)
        print(reasoning, end="", flush=True)
    elif content is not None:
        if not printed_content:
            printed_content = True
            print("\ncontent:", end="", flush=True)
        # Extract and print the content
        print(content, end="", flush=True)

Remember to check whether the reasoning exists in the response before accessing it. You could check out the example.

Tool Calling¶

The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the content field, not from the reasoning.

Code

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location", "unit"],
            }
        },
    }
]

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
    tools=tools,
    tool_choice="auto",
)

print(response)
tool_call = response.choices[0].message.tool_calls[0].function

print(f"reasoning: {response.choices[0].message.reasoning}")
print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}")

For more examples, please refer to examples/reasoning/openai_chat_completion_tool_calls_with_reasoning.py.

Server-Level Default Chat Template Kwargs¶

You can set default chat_template_kwargs at the server level using the --default-chat-template-kwargs CLI argument. This is useful for configuring reasoning behavior across all requests without requiring clients to specify it in each request.

Disabling Thinking Mode by Default¶

For models like Qwen3 where thinking is enabled by default, you can disable it server-wide:

vllm serve Qwen/Qwen3-8B \
    --reasoning-parser qwen3 \
    --default-chat-template-kwargs '{"enable_thinking": false}'

Enabling Thinking Mode by Default¶

For models like IBM Granite 3.2 or DeepSeek-V3.1 where thinking is disabled by default, you can enable it server-wide:

vllm serve ibm-granite/granite-3.2-2b-instruct \
    --reasoning-parser granite \
    --default-chat-template-kwargs '{"thinking": true}'

Request-Level Override¶

Request-level chat_template_kwargs always take priority over server defaults. For example, if the server is started with enable_thinking=false, a client can still enable it for a specific request:

response = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}}  # Overrides server default
)

Thinking Budget Control¶

Some models, such as Qwen3, DeepSeek, and Nemotron3, support a thinking budget that limits the maximum number of tokens used for reasoning.

Token counting starts from reasoning_start_str. Once the reasoning token count reaches the configured thinking_token_budget, vLLM forces the model to produce reasoning_end_str, effectively terminating the reasoning block.

To use this feature:

--reasoning-parser enables reasoning extraction.
--reasoning-config defines the reasoning boundary tokens (e.g., reasoning_start_str, reasoning_end_str). If not set, vLLM will attempt to automatically initialize these tokens from the reasoning parser.
thinking_token_budget (a sampling parameter) sets the per-request reasoning token limit.

If thinking_token_budget is not specified, no explicit reasoning limit is applied beyond normal generation constraints such as max_tokens.

--reasoning-config accepts a JSON object corresponding to
ReasoningConfig with the following fields:

Field	Type	Description
`reasoning_start_str`	`str \\| null`	String that marks the start of reasoning content
`reasoning_end_str`	`str \\| null`	String that marks the end of reasoning content

Note

reasoning_end_str can include a transition phrase before the reasoning end token. For example, setting reasoning_end_str to "I have to give the solution based on the reasoning directly now.</think>" instructs the model to emit that phrase when the budget is exhausted, making the reasoning termination more natural.

Online Serving¶

vllm serve Qwen/Qwen3-0.6B \
    --reasoning-parser qwen3 \
    --reasoning-config '{"reasoning_start_str": "<think>", "reasoning_end_str": "I have to give the solution based on the reasoning directly now.</think>"}'

Then make a request with thinking_token_budget to limit the reasoning tokens:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
      { "role": "user", "content": "9.11 and 9.8, which is greater?" }
    ],
    "thinking_token_budget": 10
  }'

Offline Inference¶

from vllm import LLM, SamplingParams
from vllm.config import ReasoningConfig

llm = LLM(
    model="Qwen/Qwen3-0.6B",
    reasoning_config=ReasoningConfig(
        reasoning_start_str="<think>",
        reasoning_end_str="I have to give the solution based on the thinking directly now.</think>",
    ),
)

sampling_params = SamplingParams(thinking_token_budget=10)

messages = [
    {"role": "user", "content": "9.11 and 9.8, which is greater?"},
]

outputs = llm.chat(messages, sampling_params=sampling_params)

for output in outputs:
    print("text:", output.outputs[0].text)

Automatic `enable_thinking` Activation¶

Some models (such as Gemma 4, DeepSeek-V4-Pro and IBM Granite 3.2) require enable_thinking: true in their chat template kwargs to activate thinking mode — without it, reasoning tokens are never generated regardless of other settings.

When you set reasoning_effort in a Chat Completions request (or reasoning.effort in a Responses API request), vLLM automatically injects enable_thinking into the chat template kwargs:

reasoning_effort = "low", "medium", or "high" → enable_thinking = true
reasoning_effort = "none" → enable_thinking = false
reasoning_effort not set → enable_thinking is not injected (preserves existing behavior)

This means you no longer need to manually pass chat_template_kwargs: {"enable_thinking": true} when using reasoning_effort — it is handled automatically.

Note

If you explicitly set enable_thinking in chat_template_kwargs, your value takes priority over the automatic injection. This allows you to override the behavior if needed.

For models whose templates don't declare enable_thinking (e.g., DeepSeek R1), the injected kwarg is harmlessly filtered out by resolve_chat_template_kwargs.

Example¶

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# reasoning_effort automatically enables thinking for models that need it
response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{"role": "user", "content": "What is 15 * 37?"}],
    reasoning_effort="high",  # Automatically sets enable_thinking=true
)

print(response.choices[0].message.reasoning)
print(response.choices[0].message.content)

Limitations¶

The reasoning content is only available for online serving's chat completion endpoint (/v1/chat/completions), Anthropic Messages API (/v1/messages) and the Responses API (/v1/responses).

How to support a new reasoning model¶

You can add a new ReasoningParser similar to vllm/reasoning/deepseek_r1_reasoning_parser.py.

Code

# import the required packages

from vllm.reasoning import ReasoningParser, ReasoningParserManager
from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
from vllm.entrypoints.openai.engine.protocol import DeltaMessage

# define a reasoning parser and register it to vllm
# the name list in register_module can be used
# in --reasoning-parser.
class ExampleParser(ReasoningParser):
    def __init__(self, tokenizer: TokenizerLike):
        super().__init__(tokenizer)

    def extract_reasoning_streaming(
        self,
        previous_text: str,
        current_text: str,
        delta_text: str,
        previous_token_ids: Sequence[int],
        current_token_ids: Sequence[int],
        delta_token_ids: Sequence[int],
    ) -> DeltaMessage | None:
        """
        Instance method that should be implemented for extracting reasoning
        from an incomplete response; for use when handling reasoning calls and
        streaming. Has to be an instance method because  it requires state -
        the current tokens/diffs, but also the information about what has
        previously been parsed and extracted (see constructor)
        """

    def extract_reasoning(
        self,
        model_output: str,
        request: ChatCompletionRequest | ResponsesRequest,
    ) -> tuple[str | None, str | None]:
        """
        Extract reasoning content from a complete model-generated string.

        Used for non-streaming responses where we have the entire model response
        available before sending to the client.

        Parameters:
        model_output: str
            The model-generated string to extract reasoning content from.

        request: ChatCompletionRequest
            The request object that was used to generate the model_output.

        Returns:
        tuple[Optional[str], Optional[str]]
            A tuple containing the reasoning content and the content.
        """
# Register the reasoning parser
ReasoningParserManager.register_lazy_module(
    name="example",
    module_path="vllm.reasoning.example_reasoning_parser",
    class_name="ExampleParser",
)

Additionally, to enable structured output, you'll need to create a new Reasoner similar to the one in vllm/reasoning/deepseek_r1_reasoning_parser.py.

Code

@dataclass
class DeepSeekReasoner(Reasoner):
    """
    Reasoner for DeepSeek R series models.
    """
    start_token_id: int
    end_token_id: int

    start_token: str = "<think>"
    end_token: str = "</think>"

    @classmethod
    def from_tokenizer(cls, tokenizer: PreTrainedTokenizer) -> Reasoner:
        return cls(
            start_token_id=tokenizer.encode("<think>", add_special_tokens=False)[0],
            end_token_id=tokenizer.encode("</think>", add_special_tokens=False)[0],
        )

    def is_reasoning_end(self, input_ids: list[int]) -> bool:
        return self.end_token_id in input_ids

    def is_reasoning_end_streaming(self, input_ids: list[int], delta_ids: list[int]) -> bool:
        return self.end_token_id in delta_token_ids
    ...

The structured output engine like xgrammar will use end_token_id to check if the reasoning content is present in the model output and skip the structured output if it is the case.

Finally, you can enable reasoning for the model by using the --reasoning-parser flags.

vllm serve <model_tag> --reasoning-parser example

Reasoning Outputs¶

Supported Models¶

Quickstart¶

Streaming chat completions¶

Tool Calling¶

Server-Level Default Chat Template Kwargs¶

Disabling Thinking Mode by Default¶

Enabling Thinking Mode by Default¶

Request-Level Override¶

Thinking Budget Control¶

Online Serving¶

Offline Inference¶

Automatic enable_thinking Activation¶

Example¶

Limitations¶

How to support a new reasoning model¶

Automatic `enable_thinking` Activation¶