Skip to content

Quickstart

This guide will help you quickly get started with vLLM-Omni to perform:

  • Offline batched inference
  • Online serving using OpenAI-compatible server

Prerequisites

  • OS: Linux
  • Python: 3.12

Installation

For installation on GPU from source:

uv venv --python 3.12 --seed
source .venv/bin/activate

# On CUDA
uv pip install vllm==0.21.0 --torch-backend=auto

# On ROCm
uv pip install vllm==0.21.0+rocm721 --extra-index-url https://wheels.vllm.ai/rocm/0.21.0/rocm721

git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e .

For additional installation methods — please see the installation guide.

Note

It is important to install the same major & minor version of vLLM and vLLM Omni, otherwise things may not work as expected. If the versions are misaligned, you will see a warning when you import vLLM Omni.

If you are seeing strange behavior with the vllm command not handling the --omni flag correctly, you most likely have a version mismatch with vLLM < 0.21.0 and vLLM Omni 0.21.0, as vLLM Omni no longer hijacks the vLLM entrypoint. Updating vLLM should resolve this issue.

Offline Inference

Text-to-image generation quickstart with vLLM-Omni:

from vllm_omni.entrypoints.omni import Omni

if __name__ == "__main__":
    omni = Omni(model="Tongyi-MAI/Z-Image-Turbo")
    prompt = "a cup of coffee on the table"
    outputs = omni.generate(prompt)
    images = outputs[0].request_output.images
    images[0].save("coffee.png")

You can pass a list of prompts and wait for them to process altogether, shown below.

Info

However, it is not currently recommended to do so because not all models support batch inference, and batch requesting mostly does not provide significant performance improvement (despite the impression that it does). This feature is primarily for the sake of interface compatibility with vLLM and to allow for future improvements.

from vllm_omni.entrypoints.omni import Omni

if __name__ == "__main__":
    omni = Omni(
        model="Tongyi-MAI/Z-Image-Turbo",
        # stage_configs_path="./stage-config.yaml",  # See below
    )
    prompts = [
        "a cup of coffee on a table",
        "a toy dinosaur on a sandy beach",
        "a fox waking up in bed and yawning",
    ]
    omni_outputs = omni.generate(prompts)
    for i_prompt, prompt_output in enumerate(omni_outputs):
        this_request_output = prompt_output.request_output
        this_images = this_request_output.images
        for i_image, image in enumerate(this_images):
            image.save(f"p{i_prompt}-img{i_image}.jpg")
            print("saved to", f"p{i_prompt}-img{i_image}.jpg")
            # saved to p0-img0.jpg
            # saved to p1-img0.jpg
            # saved to p2-img0.jpg

Info

For diffusion pipelines, the stage config field stage_args.[].engine_args.max_num_seqs is 1 by default, and the input list is sliced into single-item requests before feeding into the diffusion pipeline. For models that do internally support batched inputs, you can modify this configuration to let the model accept a longer batch of prompts.

For more usages, please refer to offline inference

Online Serving with OpenAI-Completions API

Text-to-image generation quickstart with vLLM-Omni:

vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091
curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "a cup of coffee on the table"}
    ],
    "extra_body": {
      "height": 1024,
      "width": 1024,
      "num_inference_steps": 50,
      "guidance_scale": 4.0,
      "seed": 42
    }
  }' | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2 | base64 -d > coffee.png

For more details, please refer to online serving.