Skip to content

Welcome to vLLM-Omni

vllm-omni

Easy, fast, and cheap omni-modality model serving for everyone

Star Watch Fork

About

vLLM was originally designed to support large language models for text-based autoregressive generation tasks. vLLM-Omni is a framework that extends its support for omni-modality model inference and serving:

  • Omni-modality: Text, image, video, and audio data processing
  • Non-autoregressive Architectures: extend the AR support of vLLM to Diffusion Transformers (DiT) and other parallel generation models
  • Heterogeneous outputs: from traditional text generation to multimodal outputs

vllm-omni-arch

vLLM-Omni is fast with:

  • State-of-the-art AR support by leveraging efficient KV cache management from vLLM
  • Pipelined stage execution overlapping for high throughput performance
  • Fully disaggregation based on OmniConnector and dynamic resource allocation across stages

vLLM-Omni is flexible and easy to use with:

  • Heterogeneous pipeline abstraction to manage complex model workflows
  • Seamless integration with popular Hugging Face models
  • Tensor, pipeline, data and expert parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server

vLLM-Omni seamlessly supports most popular open-source models on HuggingFace, including:

  • Omni-modality models (e.g. Qwen2.5-Omni, Qwen3-Omni)
  • Multi-modality generation models (e.g. Qwen-Image)

For more information, checkout the following: