Using VLMs#

vLLM provides experimental support for Vision Language Models (VLMs). This document shows you how to run and serve these models using vLLM.

Engine Arguments#

The following engine arguments are specific to VLMs:

usage: -m vllm.entrypoints.openai.api_server [-h]
                                             [--image-input-type {pixel_values,image_features}]
                                             [--image-token-id IMAGE_TOKEN_ID]
                                             [--image-input-shape IMAGE_INPUT_SHAPE]
                                             [--image-feature-size IMAGE_FEATURE_SIZE]
                                             [--image-processor IMAGE_PROCESSOR]
                                             [--image-processor-revision IMAGE_PROCESSOR_REVISION]

Named Arguments#


Possible choices: pixel_values, image_features

The image input type passed into vLLM.


Input id for image token.


The biggest image input shape (worst for memory footprint) given an input type. Only used for vLLM’s profile_run.


The image feature size along the context dimension.


Name or path of the huggingface image processor to use. If unspecified, model name or path will be used.


Revision of the huggingface image processor version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.


Disables the use of image processor, even if one is defined for the model on huggingface.


Currently, the support for vision language models on vLLM has the following limitations:

  • Only single image input is supported per text prompt.

  • Dynamic image_input_shape is not supported: the input image will be resized to the static image_input_shape. This means our LLaVA-NeXT output may not exactly match the huggingface implementation.

We are continuously improving user & developer experience for VLMs. Please open an issue on GitHub if you have any feedback or feature requests.

Offline Batched Inference#

To initialize a VLM, the aforementioned arguments must be passed to the LLM class for instantiating the engine.

llm = LLM(

To pass an image to the model, note the following in vllm.inputs.PromptStrictInputs:

  • prompt: The prompt should have a number of <image> tokens equal to image_feature_size.

  • multi_modal_data: This should be an instance of ImagePixelData or ImageFeatureData.

prompt = "<image>" * 576 + (
    "\nUSER: What is the content of this image?\nASSISTANT:")

# Load the image using PIL.Image
image = ...

outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": ImagePixelData(image),

for o in outputs:
    generated_text = o.outputs[0].text

A code example can be found in examples/

Online OpenAI Vision API Compatible Inference#

You can serve vision language models with vLLM’s HTTP server that is compatible with OpenAI Vision API.


Currently, vLLM supports only single image_url input per messages. Support for multi-image inputs will be added in the future.

Below is an example on how to launch the same llava-hf/llava-1.5-7b-hf with vLLM API server.


Since OpenAI Vision API is based on Chat API, a chat template is required to launch the API server if the model’s tokenizer does not come with one. In this example, we use the HuggingFace Llava chat template that you can find in the example folder here.

python -m vllm.entrypoints.openai.api_server \
    --model llava-hf/llava-1.5-7b-hf \
    --image-input-type pixel_values \
    --image-token-id 32000 \
    --image-input-shape 1,3,336,336 \
    --image-feature-size 576 \
    --chat-template template_llava.jinja

To consume the server, you can use the OpenAI client like in the example below:

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
chat_response =
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
                "type": "image_url",
                "image_url": {
                    "url": "",
print("Chat response:", chat_response)


By default, the timeout for fetching images through http url is 5 seconds. You can override this by setting the environment variable:



The prompt formatting with the image token <image> is not needed when serving VLMs with the API server since the prompt will be processed automatically by the server.