Text-To-Video¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/text_to_video.
This example demonstrates how to deploy text-to-video models for online video generation using vLLM-Omni.
Supported Models¶
| Model | Model ID |
|---|---|
| Wan2.1 T2V (1.3B) | Wan-AI/Wan2.1-T2V-1.3B-Diffusers |
| Wan2.1 T2V (14B) | Wan-AI/Wan2.1-T2V-14B-Diffusers |
| Wan2.2 T2V | Wan-AI/Wan2.2-T2V-A14B-Diffusers |
| LTX-2 | Lightricks/LTX-2 |
Wan2.2 T2V¶
Start Server¶
Basic Start¶
Start with Parameters¶
Or use the startup script:
The script allows overriding: - MODEL (default: Wan-AI/Wan2.2-T2V-A14B-Diffusers) - PORT (default: 8091) - BOUNDARY_RATIO (default: 0.875) - FLOW_SHIFT (default: 5.0) - CACHE_BACKEND (default: none) - ENABLE_CACHE_DIT_SUMMARY (default: 0)
Async Job Behavior¶
POST /v1/videos is asynchronous. It creates a video job and immediately returns metadata like the job ID and initial queued status. To get the final artifact, poll the job status and then download the completed file from the content endpoint.
The main endpoints are: - POST /v1/videos: create a video generation job (async) - POST /v1/videos/sync: generate a video and return raw bytes (sync, for benchmarks) - GET /v1/videos/{video_id}: retrieve the current job status and metadata - GET /v1/videos: list stored video jobs - GET /v1/videos/{video_id}/content: download the generated video file - DELETE /v1/videos/{video_id}: delete the job and any stored output
Sync API (Benchmark / Testing)¶
POST /v1/videos/sync is a synchronous alternative that blocks until generation completes and returns the raw video bytes (video/mp4) directly in the response body. It is designed for benchmark and testing scenarios where one-shot request/response latency measurement is needed.
The sync endpoint accepts the same form parameters as POST /v1/videos. It does not create any stored job record — the response is purely the generated video file. Metadata is returned via response headers:
X-Request-Id: unique identifier for this generation requestX-Model: model name used for generationX-Inference-Time-S: wall-clock inference time in seconds
curl -X POST http://localhost:8091/v1/videos/sync \
-F "prompt=Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
-F "size=832x480" \
-F "num_frames=33" \
-F "fps=16" \
-F "num_inference_steps=40" \
-F "guidance_scale=4.0" \
-F "guidance_scale_2=4.0" \
-F "boundary_ratio=0.875" \
-F "flow_shift=5.0" \
-F "seed=42" \
-o sync_t2v_output.mp4
Storage¶
Generated video files are stored on local disk by the async video API. Local file storage behavior can be controlled via the following environment variables:
VLLM_OMNI_STORAGE_PATH: directory used for generated files (default:/tmp/storage)VLLM_OMNI_STORAGE_MAX_CONCURRENCY: max concurrent save/delete operations (default:4)
Example:
API Calls¶
Method 1: Using curl¶
# Basic text-to-video generation
bash run_curl_text_to_video.sh
# Or execute directly (OpenAI-style multipart)
create_response=$(curl -s http://localhost:8091/v1/videos \
-H "Accept: application/json" \
-F "prompt=Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
-F "width=832" \
-F "height=480" \
-F "num_frames=33" \
-F "negative_prompt=色调艳丽 ,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" \
-F "fps=16" \
-F "num_inference_steps=40" \
-F "guidance_scale=4.0" \
-F "guidance_scale_2=4.0" \
-F "boundary_ratio=0.875" \
-F "flow_shift=5.0" \
-F "seed=42")
video_id=$(echo "$create_response" | jq -r '.id')
while true; do
status=$(curl -s "http://localhost:8091/v1/videos/${video_id}" | jq -r '.status')
if [ "$status" = "completed" ]; then
break
fi
if [ "$status" = "failed" ]; then
echo "Video generation failed"
exit 1
fi
sleep 2
done
curl -s "http://localhost:8091/v1/videos/${video_id}" | jq .
curl -L "http://localhost:8091/v1/videos/${video_id}/content" -o wan22_output.mp4
Request Format¶
Simple Text-to-Video Generation¶
curl -X POST http://localhost:8091/v1/videos \
-F "prompt=A cinematic view of a futuristic city at sunset"
Generation with Parameters¶
curl -X POST http://localhost:8091/v1/videos \
-F "prompt=A cinematic view of a futuristic city at sunset" \
-F "width=832" \
-F "height=480" \
-F "num_frames=33" \
-F "negative_prompt=low quality, blurry, static" \
-F "fps=16" \
-F "num_inference_steps=40" \
-F "guidance_scale=4.0" \
-F "guidance_scale_2=4.0" \
-F "boundary_ratio=0.875" \
-F "flow_shift=5.0" \
-F "seed=42"
Generation Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt | str | - | Text description of the desired video |
seconds | str | None | Clip duration in seconds |
size | str | None | Output size in WIDTHxHEIGHT format |
negative_prompt | str | None | Negative prompt |
width | int | None | Video width in pixels |
height | int | None | Video height in pixels |
num_frames | int | None | Number of frames to generate |
fps | int | None | Frames per second for output video |
num_inference_steps | int | None | Number of denoising steps |
guidance_scale | float | None | CFG guidance scale (low-noise stage) |
guidance_scale_2 | float | None | CFG guidance scale (high-noise stage, Wan2.2) |
boundary_ratio | float | None | Boundary split ratio for low/high DiT (Wan2.2) |
flow_shift | float | None | Scheduler flow shift (Wan2.2) |
seed | int | None | Random seed (reproducible) |
lora | object | None | LoRA configuration |
Create Response Format¶
POST /v1/videos returns a job record, not inline base64 video data.
{
"id": "video_gen_123",
"object": "video",
"status": "queued",
"model": "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
"prompt": "A cinematic view of a futuristic city at sunset",
"created_at": 1234567890
}
Retrieve, List, Download, and Delete¶
Retrieve a job¶
List jobs¶
Download the completed video¶
Delete a job and its stored file¶
Poll Until Complete¶
while true; do
status=$(curl -s http://localhost:8091/v1/videos/${video_id} | jq -r '.status')
if [ "$status" = "completed" ]; then
break
fi
if [ "$status" = "failed" ]; then
echo "Video generation failed"
exit 1
fi
sleep 2
done
LTX-2¶
Start Server¶
Basic Start¶
vllm serve Lightricks/LTX-2 --omni --port 8098 \
--enforce-eager --flow-shift 1.0 --boundary-ratio 1.0
Start with Optimization Presets¶
Use the LTX-2 startup script with built-in optimization presets:
# Baseline (1 GPU, eager)
bash run_server_ltx2.sh baseline
# 4-GPU Ulysses sequence parallelism (lossless)
bash run_server_ltx2.sh ulysses4
# Cache-DiT lossy acceleration (1 GPU, ~1.4× speedup)
bash run_server_ltx2.sh cache-dit
# Best combo: 4-GPU Ulysses SP + Cache-DiT (~2.2× speedup)
bash run_server_ltx2.sh best-combo
Optimization Benchmarks¶
Benchmarked on H800, online serving (480×768, 41 frames, 20 steps, seed=42). "Inference" is the server-reported inference time; excludes HTTP/poll overhead.
| Preset | Server Command | Inference (s) | Speedup | Type |
|---|---|---|---|---|
baseline | --enforce-eager | 10.3 | 1.00× | — |
compile | (default, no --enforce-eager) | ~10.3 (warm) | ~1.00× | Lossless |
ulysses4 | --enforce-eager --usp 4 | ~10.3 | ~1.00× | Lossless |
cache-dit | --enforce-eager --cache-backend cache_dit | 7.4 avg | ~1.4× | Lossy |
best-combo | --enforce-eager --usp 4 --cache-backend cache_dit | 4.7 avg | ~2.2× | Lossless + Lossy |
Observations: - torch.compile: On H800, warm-request inference time matches the eager baseline (~10.3s). The first request pays ~6s compilation overhead. Benefit depends on model architecture and GPU. - Ulysses SP (4 GPU): No measurable speedup alone for 41-frame generation at this resolution. Communication overhead outweighs gains at this sequence length. - Cache-DiT: Inference varies per request (6–10s) due to dynamic caching decisions. Average is ~7.4s (~1.4× speedup) with slight quality tradeoff. - Best combo: 4-GPU Ulysses SP + Cache-DiT synergize well — Cache-DiT reduces per-step computation, making the communication overhead of Ulysses SP worthwhile. Average ~4.7s (~2.2× speedup). - FP8 quantization: Reduces VRAM but does not speed up LTX-2 on H800 (compute-bound).
Deployment Recommendations: - For production with quality priority: use baseline with --enforce-eager - For maximum throughput (4 GPUs, quality tradeoff): use best-combo (~2.2× speedup) - For single-GPU throughput: use cache-dit (~1.4× speedup) - --enforce-eager is recommended to avoid torch.compile warmup latency on first request
Send Requests (curl)¶
# Using the provided script
bash run_curl_ltx2.sh
# Or directly
curl -sS -X POST http://localhost:8098/v1/videos \
-H "Accept: application/json" \
-F "prompt=A serene lakeside sunrise with mist over the water." \
-F "width=768" \
-F "height=480" \
-F "num_frames=41" \
-F "fps=24" \
-F "num_inference_steps=20" \
-F "guidance_scale=3.0" \
-F "seed=42"
Example materials¶
run_curl_hunyuan_video_15.sh
#!/bin/bash
# HunyuanVideo-1.5 text-to-video curl example using the async video job API.
set -euo pipefail
BASE_URL="${BASE_URL:-http://localhost:8098}"
OUTPUT_PATH="${OUTPUT_PATH:-hunyuan_video_15_t2v.mp4}"
POLL_INTERVAL="${POLL_INTERVAL:-2}"
create_response=$(
curl -sS -X POST "${BASE_URL}/v1/videos" \
-H "Accept: application/json" \
-F "prompt=A little girl wearing a straw hat runs through a summer meadow full of wildflowers. A wide shot is used, with the camera panning right to follow her." \
-F "size=832x480" \
-F "num_frames=33" \
-F "fps=24" \
-F "num_inference_steps=30" \
-F "guidance_scale=6.0" \
-F "flow_shift=5.0" \
-F "seed=42"
)
video_id="$(echo "${create_response}" | jq -r '.id')"
if [ -z "${video_id}" ] || [ "${video_id}" = "null" ]; then
echo "Failed to create video job:"
echo "${create_response}" | jq .
exit 1
fi
echo "Created video job ${video_id}"
echo "${create_response}" | jq .
while true; do
status_response="$(curl -sS "${BASE_URL}/v1/videos/${video_id}")"
status="$(echo "${status_response}" | jq -r '.status')"
case "${status}" in
queued|in_progress)
echo "Video job ${video_id} status: ${status}"
sleep "${POLL_INTERVAL}"
;;
completed)
echo "${status_response}" | jq .
break
;;
failed)
echo "Video generation failed:"
echo "${status_response}" | jq .
exit 1
;;
*)
echo "Unexpected status response:"
echo "${status_response}" | jq .
exit 1
;;
esac
done
curl -sS -L "${BASE_URL}/v1/videos/${video_id}/content" -o "${OUTPUT_PATH}"
echo "Saved video to ${OUTPUT_PATH}"
run_curl_ltx2.sh
#!/bin/bash
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
#
# LTX-2 text-to-video curl example using the async video job API.
# Start the server first: bash run_server_ltx2.sh best-combo
set -euo pipefail
BASE_URL="${BASE_URL:-http://localhost:8098}"
OUTPUT_PATH="${OUTPUT_PATH:-ltx2_output.mp4}"
POLL_INTERVAL="${POLL_INTERVAL:-2}"
PROMPT="${PROMPT:-A serene lakeside sunrise with mist over the water.}"
create_response=$(
curl -sS -X POST "${BASE_URL}/v1/videos" \
-H "Accept: application/json" \
-F "prompt=${PROMPT}" \
-F "width=768" \
-F "height=480" \
-F "num_frames=41" \
-F "fps=24" \
-F "num_inference_steps=20" \
-F "guidance_scale=3.0" \
-F "seed=42"
)
video_id="$(echo "${create_response}" | jq -r '.id')"
if [ -z "${video_id}" ] || [ "${video_id}" = "null" ]; then
echo "Failed to create video job:"
echo "${create_response}" | jq .
exit 1
fi
echo "Created video job ${video_id}"
echo "${create_response}" | jq .
while true; do
status_response="$(curl -sS "${BASE_URL}/v1/videos/${video_id}")"
status="$(echo "${status_response}" | jq -r '.status')"
case "${status}" in
queued|in_progress)
echo "Video job ${video_id} status: ${status}"
sleep "${POLL_INTERVAL}"
;;
completed)
echo "${status_response}" | jq .
break
;;
failed)
echo "Video generation failed:"
echo "${status_response}" | jq .
exit 1
;;
*)
echo "Unexpected status response:"
echo "${status_response}" | jq .
exit 1
;;
esac
done
curl -sS -L "${BASE_URL}/v1/videos/${video_id}/content" -o "${OUTPUT_PATH}"
echo "Saved video to ${OUTPUT_PATH}"
run_curl_text_to_video.sh
#!/bin/bash
# Wan2.2 text-to-video curl example using the async video job API.
set -euo pipefail
BASE_URL="${BASE_URL:-http://localhost:8098}"
OUTPUT_PATH="${OUTPUT_PATH:-wan22_output.mp4}"
POLL_INTERVAL="${POLL_INTERVAL:-2}"
create_response=$(
curl -sS -X POST "${BASE_URL}/v1/videos" \
-H "Accept: application/json" \
-F "prompt=Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
-F "seconds=2" \
-F "size=832x480" \
-F "negative_prompt=色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" \
-F "fps=16" \
-F "num_inference_steps=40" \
-F "guidance_scale=4.0" \
-F "guidance_scale_2=4.0" \
-F "boundary_ratio=0.875" \
-F "flow_shift=5.0" \
-F "seed=42"
)
video_id="$(echo "${create_response}" | jq -r '.id')"
if [ -z "${video_id}" ] || [ "${video_id}" = "null" ]; then
echo "Failed to create video job:"
echo "${create_response}" | jq .
exit 1
fi
echo "Created video job ${video_id}"
echo "${create_response}" | jq .
while true; do
status_response="$(curl -sS "${BASE_URL}/v1/videos/${video_id}")"
status="$(echo "${status_response}" | jq -r '.status')"
case "${status}" in
queued|in_progress)
echo "Video job ${video_id} status: ${status}"
sleep "${POLL_INTERVAL}"
;;
completed)
echo "${status_response}" | jq .
break
;;
failed)
echo "Video generation failed:"
echo "${status_response}" | jq .
exit 1
;;
*)
echo "Unexpected status response:"
echo "${status_response}" | jq .
exit 1
;;
esac
done
curl -sS -L "${BASE_URL}/v1/videos/${video_id}/content" -o "${OUTPUT_PATH}"
echo "Saved video to ${OUTPUT_PATH}"
run_server.sh
#!/bin/bash
# Wan2.2 online serving startup script
MODEL="${MODEL:-Wan-AI/Wan2.2-T2V-A14B-Diffusers}"
PORT="${PORT:-8098}"
BOUNDARY_RATIO="${BOUNDARY_RATIO:-0.875}"
FLOW_SHIFT="${FLOW_SHIFT:-5.0}"
CACHE_BACKEND="${CACHE_BACKEND:-none}"
ENABLE_CACHE_DIT_SUMMARY="${ENABLE_CACHE_DIT_SUMMARY:-0}"
echo "Starting Wan2.2 server..."
echo "Model: $MODEL"
echo "Port: $PORT"
echo "Boundary ratio: $BOUNDARY_RATIO"
echo "Flow shift: $FLOW_SHIFT"
echo "Cache backend: $CACHE_BACKEND"
if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then
echo "Cache-DiT summary: enabled"
fi
CACHE_BACKEND_FLAG=""
if [ "$CACHE_BACKEND" != "none" ]; then
CACHE_BACKEND_FLAG="--cache-backend $CACHE_BACKEND"
fi
vllm serve "$MODEL" --omni \
--port "$PORT" \
--boundary-ratio "$BOUNDARY_RATIO" \
--flow-shift "$FLOW_SHIFT" \
$CACHE_BACKEND_FLAG \
$(if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then echo "--enable-cache-dit-summary"; fi)
run_server_hunyuan_video_15.sh
#!/bin/bash
# HunyuanVideo-1.5 text-to-video online serving startup script
#
# 480p: ~35 GB VRAM (BF16), fits 1x A100 80GB
# 720p: needs FP8 + VAE tiling, ~35 GB VRAM
MODEL="${MODEL:-hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v}"
PORT="${PORT:-8098}"
FLOW_SHIFT="${FLOW_SHIFT:-5.0}"
QUANTIZATION="${QUANTIZATION:-}"
CACHE_BACKEND="${CACHE_BACKEND:-none}"
echo "Starting HunyuanVideo-1.5 T2V server..."
echo "Model: $MODEL"
echo "Port: $PORT"
echo "Flow shift: $FLOW_SHIFT"
echo "Quantization: ${QUANTIZATION:-none}"
echo "Cache backend: $CACHE_BACKEND"
EXTRA_FLAGS=""
if [ -n "$QUANTIZATION" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS --quantization $QUANTIZATION"
fi
if [ "$CACHE_BACKEND" != "none" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS --cache-backend $CACHE_BACKEND"
fi
vllm serve "$MODEL" --omni \
--port "$PORT" \
--flow-shift "$FLOW_SHIFT" \
$EXTRA_FLAGS
run_server_ltx2.sh
#!/bin/bash
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
#
# LTX-2 online serving startup script with optimization presets.
#
# Usage:
# bash run_server_ltx2.sh # baseline (1 GPU, eager)
# bash run_server_ltx2.sh ulysses4 # 4-GPU Ulysses SP
# bash run_server_ltx2.sh cache-dit # 1 GPU + Cache-DiT
# bash run_server_ltx2.sh best-combo # 4-GPU Ulysses SP + Cache-DiT
#
# Online serving benchmarks on H800 (480×768, 41 frames, 20 steps):
# baseline : 10.3s inference (1.00×)
# compile : ~10.3s warm (~1.00×) first request +6s warmup
# ulysses4 : ~10.3s (~1.00×) no gain at 41 frames
# cache-dit : 7.4s avg (~1.4×) lossy, variable per request
# best-combo : 4.7s avg (~2.2×) 4-GPU ulysses + cache-dit
set -euo pipefail
MODEL="${MODEL:-Lightricks/LTX-2}"
PORT="${PORT:-8098}"
FLOW_SHIFT="${FLOW_SHIFT:-1.0}"
BOUNDARY_RATIO="${BOUNDARY_RATIO:-1.0}"
PRESET="${1:-baseline}"
EXTRA_ARGS=()
case "$PRESET" in
baseline)
echo "=== LTX-2 Preset: baseline (1 GPU, enforce-eager) ==="
EXTRA_ARGS+=(--enforce-eager)
;;
ulysses2)
echo "=== LTX-2 Preset: 2-GPU Ulysses SP (lossless) ==="
EXTRA_ARGS+=(--enforce-eager --usp 2)
;;
ulysses4)
echo "=== LTX-2 Preset: 4-GPU Ulysses SP (lossless) ==="
EXTRA_ARGS+=(--enforce-eager --usp 4)
;;
cache-dit)
echo "=== LTX-2 Preset: Cache-DiT (1 GPU, lossy) ==="
EXTRA_ARGS+=(--enforce-eager --cache-backend cache_dit)
;;
best-combo)
echo "=== LTX-2 Preset: 4-GPU Ulysses SP + Cache-DiT (best combo) ==="
EXTRA_ARGS+=(--enforce-eager --usp 4 --cache-backend cache_dit)
;;
compile)
echo "=== LTX-2 Preset: torch.compile (1 GPU, lossless) ==="
# torch.compile is the default (no --enforce-eager)
;;
*)
echo "Usage: $0 {baseline|ulysses2|ulysses4|cache-dit|best-combo|compile}"
echo ""
echo "Presets:"
echo " baseline - 1 GPU, eager execution (reference)"
echo " ulysses2 - 2-GPU Ulysses SP (lossless)"
echo " ulysses4 - 4-GPU Ulysses SP (lossless)"
echo " cache-dit - 1 GPU + Cache-DiT (lossy, ~1.4× speedup)"
echo " best-combo - 4-GPU Ulysses SP + Cache-DiT (~2.2× speedup)"
echo " compile - 1 GPU + torch.compile (slower first request)"
echo ""
echo "Environment variables:"
echo " MODEL - Model path (default: Lightricks/LTX-2)"
echo " PORT - Server port (default: 8098)"
echo " FLOW_SHIFT - Scheduler flow shift (default: 1.0)"
echo " BOUNDARY_RATIO - Boundary ratio (default: 1.0)"
exit 1
;;
esac
echo "Model: $MODEL"
echo "Port: $PORT"
echo "Flow shift: $FLOW_SHIFT"
echo "Boundary ratio: $BOUNDARY_RATIO"
vllm serve "$MODEL" --omni \
--port "$PORT" \
--flow-shift "$FLOW_SHIFT" \
--boundary-ratio "$BOUNDARY_RATIO" \
"${EXTRA_ARGS[@]}"