Videos API¶
vLLM-Omni provides an OpenAI-compatible video generation API for diffusion video models. The API supports asynchronous video jobs through /v1/videos and a synchronous benchmark-oriented endpoint through /v1/videos/sync.
Each server instance runs a single model specified at startup with vllm serve <model> --omni.
Quick Start¶
Start the Server¶
Create a Video Job¶
create_response=$(curl -s http://localhost:8091/v1/videos \
-F "prompt=A cinematic tracking shot of a mountain lake at sunrise" \
-F "width=1280" \
-F "height=720" \
-F "num_frames=80" \
-F "fps=16" \
-F "num_inference_steps=40")
video_id=$(echo "${create_response}" | jq -r '.id')
Poll and Download¶
curl -s "http://localhost:8091/v1/videos/${video_id}" | jq .
curl -L "http://localhost:8091/v1/videos/${video_id}/content" -o output.mp4
API Reference¶
Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/v1/videos | POST | Create an asynchronous video generation job |
/v1/videos/sync | POST | Generate a video synchronously and return raw video bytes |
/v1/videos/{video_id} | GET | Retrieve job status and metadata |
/v1/videos | GET | List stored video jobs |
/v1/videos/{video_id}/content | GET | Download generated video content |
/v1/videos/{video_id} | DELETE | Delete a video job and stored output |
Request Parameters¶
POST /v1/videos and POST /v1/videos/sync accept multipart/form-data.
OpenAI-style fields¶
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt | string | required | Text prompt for video generation |
model | string | server's model | Optional model name |
seconds | string | null | Requested clip duration in seconds |
size | string | null | Requested output size in WIDTHxHEIGHT format |
user | string | null | Optional user identifier |
vLLM-Omni extension fields¶
| Parameter | Type | Default | Description |
|---|---|---|---|
input_reference | file | null | Uploaded reference image or video for image-to-video/video-to-video requests |
image_reference | string | null | JSON-encoded reference image payload; do not combine with input_reference or video_reference |
video_reference | string | null | JSON-encoded reference video payload; do not combine with input_reference or image_reference |
audio_reference | string | null | JSON-encoded audio reference for speech-to-video: {"audio_url": "..."} — supports HTTP(s) URLs or base64 data URLs |
width | integer | model default | Output video width |
height | integer | model default | Output video height |
num_frames | integer | model default | Number of generated frames |
fps | integer | model default | Output frames per second |
num_inference_steps | integer | model default | Number of diffusion steps |
guidance_scale | number | null | CFG guidance scale for the low-noise stage |
guidance_scale_2 | number | null | CFG guidance scale for the high-noise stage |
boundary_ratio | number | null | Boundary split ratio for multi-stage denoising |
flow_shift | number | null | Scheduler flow-shift value |
true_cfg_scale | number | null | True CFG scale when supported by the model |
seed | integer | null | Random seed for reproducibility |
generate_sound | boolean | false | Request model-generated audio for video models that support sound generation |
sound_duration | number | null | Duration in seconds for generated audio; defaults to generated video duration |
negative_prompt | string | null | Text describing what to avoid in the generated video |
enable_frame_interpolation | boolean | null | Enable post-generation frame interpolation |
frame_interpolation_exp | integer | null | Interpolation exponent; 1=2x, 2=4x, and so on |
frame_interpolation_scale | number | null | RIFE inference scale |
frame_interpolation_model_path | string | null | Local path or Hugging Face repo for the interpolation model |
lora | string | null | JSON-encoded LoRA configuration object |
extra_params | string | null | JSON-encoded object for additional model-specific parameters |
Create Response¶
POST /v1/videos returns a job record:
The final content is available from /v1/videos/{video_id}/content after the job status becomes completed.
Synchronous Response¶
POST /v1/videos/sync blocks until generation finishes and returns raw video bytes. It is useful for benchmarks and simple scripts that do not need job storage or polling.
Examples¶
Image-to-Video¶
curl -s http://localhost:8091/v1/videos \
-F "prompt=animate this image with subtle camera movement" \
-F "[email protected]" \
-F "width=1280" \
-F "height=720" \
-F "num_frames=80" \
-F "fps=16"
Video-to-Video¶
For models that support video conditioning, upload the reference video with input_reference:
curl -s http://localhost:8091/v1/videos \
-F "prompt=continue this motion with consistent subjects and lighting" \
-F "[email protected];type=video/mp4" \
-F "width=1280" \
-F "height=720" \
-F "num_frames=80" \
-F "fps=16"
You can also pass a JSON-safe video URL or data:video/...;base64,... payload through video_reference. Do not send video_reference together with input_reference or image_reference.
curl -s http://localhost:8091/v1/videos \
-F "prompt=continue this motion with consistent subjects and lighting" \
-F 'video_reference={"video_url":"https://example.com/input.mp4"}' \
-F "width=1280" \
-F "height=720" \
-F "num_frames=80" \
-F "fps=16"
JSON references currently support image_url/video_url; file_id references are not implemented yet. Models may expose additional V2V controls through extra_params. For example, Cosmos3 supports condition_frame_indexes_vision and condition_video_keep to select which decoded reference frames are used as clean conditioning. Cosmos3 transfer mode also accepts edge, blur, depth, seg, or wsm control hints plus transfer options such as control_path, control_guidance, control_guidance_interval, num_video_frames_per_chunk, num_conditional_frames, show_control_condition, and show_input; see the Cosmos3 recipe for complete examples.
Speech-to-Video¶
For models that support audio-driven generation (e.g., Wan2.2-S2V), pass both an image reference and an audio reference. The audio_reference field accepts a JSON string with audio_url pointing to an HTTP(s) URL or base64 data URL.
curl -s http://localhost:8091/v1/videos \
-F "prompt=A person singing" \
-F 'image_reference={"image_url": "https://example.com/face.png"}' \
-F 'audio_reference={"audio_url": "https://example.com/speech.mp3"}' \
-F "width=832" \
-F "height=480" \
-F "num_inference_steps=40" \
-F "guidance_scale=4.5" \
-F "fps=16"
Synchronous Generation¶
curl -X POST http://localhost:8091/v1/videos/sync \
-F "prompt=A small robot walking through a neon city" \
-F "width=854" \
-F "height=480" \
-F "num_frames=80" \
-F "fps=16" \
-o output.mp4
Storage¶
Set VLLM_OMNI_SERVER_STORAGE__PATH to control where asynchronous video outputs are stored:
VLLM_OMNI_STORAGE_PATHis deprecated and will be removed in a future release; useVLLM_OMNI_SERVER_STORAGE__PATHinstead.
Model-Specific Examples¶
For complete text-to-video, image-to-video, and model-specific video-to-video walkthroughs, see:
- Text-to-Video
- Image-to-Video
- Speech-to-Video for Wan2.2-S2V audio-driven lip-sync generation
- Cosmos3 recipes for model-specific video-to-video examples and conditioning controls