Text-To-Video¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/text_to_video.

This example demonstrates how to deploy text-to-video models for online video generation using vLLM-Omni.

Supported Models¶

Model	Model ID
Wan2.1 T2V (1.3B)	`Wan-AI/Wan2.1-T2V-1.3B-Diffusers`
Wan2.1 T2V (14B)	`Wan-AI/Wan2.1-T2V-14B-Diffusers`
Wan2.2 T2V	`Wan-AI/Wan2.2-T2V-A14B-Diffusers`
LTX-2	`Lightricks/LTX-2`
Helios (Base / Mid / Distilled)	`BestWishYsh/Helios-Base`, `Helios-Mid`, `Helios-Distilled`

Wan2.2 T2V¶

Start Server¶

Basic Start¶

vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8091

Start with Parameters¶

Or use the startup script:

bash run_server.sh

The script allows overriding: - MODEL (default: Wan-AI/Wan2.2-T2V-A14B-Diffusers) - PORT (default: 8091) - BOUNDARY_RATIO (default: 0.875) - FLOW_SHIFT (default: 5.0) - CACHE_BACKEND (default: none) - ENABLE_CACHE_DIT_SUMMARY (default: 0)

Async Job Behavior¶

POST /v1/videos is asynchronous. It creates a video job and immediately returns metadata like the job ID and initial queued status. To get the final artifact, poll the job status and then download the completed file from the content endpoint.

The main endpoints are: - POST /v1/videos: create a video generation job (async) - POST /v1/videos/sync: generate a video and return raw bytes (sync, for benchmarks) - GET /v1/videos/{video_id}: retrieve the current job status and metadata - GET /v1/videos: list stored video jobs - GET /v1/videos/{video_id}/content: download the generated video file - DELETE /v1/videos/{video_id}: delete the job and any stored output

Sync API (Benchmark / Testing)¶

POST /v1/videos/sync is a synchronous alternative that blocks until generation completes and returns the raw video bytes (video/mp4) directly in the response body. It is designed for benchmark and testing scenarios where one-shot request/response latency measurement is needed.

The sync endpoint accepts the same form parameters as POST /v1/videos. It does not create any stored job record — the response is purely the generated video file. Metadata is returned via response headers:

X-Request-Id: unique identifier for this generation request
X-Model: model name used for generation
X-Inference-Time-S: wall-clock inference time in seconds

curl -X POST http://localhost:8091/v1/videos/sync \
  -F "prompt=Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
  -F "size=832x480" \
  -F "num_frames=33" \
  -F "fps=16" \
  -F "num_inference_steps=40" \
  -F "guidance_scale=4.0" \
  -F "guidance_scale_2=4.0" \
  -F "boundary_ratio=0.875" \
  -F "flow_shift=5.0" \
  -F "seed=42" \
  -o sync_t2v_output.mp4

Storage¶

Generated video files are stored on local disk by the async video API. Local file storage behavior can be controlled via the following environment variables:

VLLM_OMNI_SERVER_STORAGE__PATH: directory used for generated files (default: /tmp/storage)
VLLM_OMNI_SERVER_STORAGE__FILE_CONCURRENCY: max concurrent save/delete operations (default: 4)

VLLM_OMNI_STORAGE_PATH and VLLM_OMNI_STORAGE_MAX_CONCURRENCY are deprecated and will be removed in a future release; use the names above instead.

Example:

export VLLM_OMNI_SERVER_STORAGE__PATH=/var/tmp/vllm-omni-videos
export VLLM_OMNI_SERVER_STORAGE__FILE_CONCURRENCY=8

API Calls¶

Method 1: Using curl¶

# Basic text-to-video generation
bash run_curl_text_to_video.sh

# Or execute directly (OpenAI-style multipart)
create_response=$(curl -s http://localhost:8091/v1/videos \
  -H "Accept: application/json" \
  -F "prompt=Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
  -F "width=832" \
  -F "height=480" \
  -F "num_frames=33" \
  -F "negative_prompt=色调艳丽 ，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" \
  -F "fps=16" \
  -F "num_inference_steps=40" \
  -F "guidance_scale=4.0" \
  -F "guidance_scale_2=4.0" \
  -F "boundary_ratio=0.875" \
  -F "flow_shift=5.0" \
  -F "seed=42")

video_id=$(echo "$create_response" | jq -r '.id')
while true; do
  status=$(curl -s "http://localhost:8091/v1/videos/${video_id}" | jq -r '.status')
  if [ "$status" = "completed" ]; then
    break
  fi
  if [ "$status" = "failed" ]; then
    echo "Video generation failed"
    exit 1
  fi
  sleep 2
done

curl -s "http://localhost:8091/v1/videos/${video_id}" | jq .
curl -L "http://localhost:8091/v1/videos/${video_id}/content" -o wan22_output.mp4

Request Format¶

Simple Text-to-Video Generation¶

curl -X POST http://localhost:8091/v1/videos \
  -F "prompt=A cinematic view of a futuristic city at sunset"

Generation with Parameters¶

curl -X POST http://localhost:8091/v1/videos \
  -F "prompt=A cinematic view of a futuristic city at sunset" \
  -F "width=832" \
  -F "height=480" \
  -F "num_frames=33" \
  -F "negative_prompt=low quality, blurry, static" \
  -F "fps=16" \
  -F "num_inference_steps=40" \
  -F "guidance_scale=4.0" \
  -F "guidance_scale_2=4.0" \
  -F "boundary_ratio=0.875" \
  -F "flow_shift=5.0" \
  -F "seed=42"

Generation Parameters¶

Parameter	Type	Default	Description
`prompt`	str	-	Text description of the desired video
`seconds`	str	None	Clip duration in seconds
`size`	str	None	Output size in `WIDTHxHEIGHT` format
`negative_prompt`	str	None	Negative prompt
`width`	int	None	Video width in pixels
`height`	int	None	Video height in pixels
`num_frames`	int	None	Number of frames to generate
`fps`	int	None	Frames per second for output video
`num_inference_steps`	int	None	Number of denoising steps
`guidance_scale`	float	None	CFG guidance scale (low-noise stage)
`guidance_scale_2`	float	None	CFG guidance scale (high-noise stage, Wan2.2)
`boundary_ratio`	float	None	Boundary split ratio for low/high DiT (Wan2.2)
`flow_shift`	float	None	Scheduler flow shift (Wan2.2)
`seed`	int	None	Random seed (reproducible)
`lora`	object	None	LoRA configuration

Create Response Format¶

POST /v1/videos returns a job record, not inline base64 video data.

{
  "id": "video_gen_123",
  "object": "video",
  "status": "queued",
  "model": "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
  "prompt": "A cinematic view of a futuristic city at sunset",
  "created_at": 1234567890
}

Retrieve, List, Download, and Delete¶

Retrieve a job¶

curl -s http://localhost:8091/v1/videos/${video_id} | jq .

List jobs¶

curl -s http://localhost:8091/v1/videos | jq .

Download the completed video¶

curl -L http://localhost:8091/v1/videos/${video_id}/content -o wan22_output.mp4

Delete a job and its stored file¶

curl -X DELETE http://localhost:8091/v1/videos/${video_id} | jq .

Poll Until Complete¶

while true; do
  status=$(curl -s http://localhost:8091/v1/videos/${video_id} | jq -r '.status')
  if [ "$status" = "completed" ]; then
    break
  fi
  if [ "$status" = "failed" ]; then
    echo "Video generation failed"
    exit 1
  fi
  sleep 2
done

LTX-2¶

Start Server¶

Basic Start¶

vllm serve Lightricks/LTX-2 --omni --port 8098 \
    --enforce-eager --flow-shift 1.0 --boundary-ratio 1.0

Start with Optimization Presets¶

Use the LTX-2 startup script with built-in optimization presets:

# Baseline (1 GPU, eager)
bash run_server_ltx2.sh baseline

# 4-GPU Ulysses sequence parallelism (lossless)
bash run_server_ltx2.sh ulysses4

# Cache-DiT lossy acceleration (1 GPU, ~1.4× speedup)
bash run_server_ltx2.sh cache-dit

# Best combo: 4-GPU Ulysses SP + Cache-DiT (~2.2× speedup)
bash run_server_ltx2.sh best-combo

Optimization Benchmarks¶

Benchmarked on H800, online serving (480×768, 41 frames, 20 steps, seed=42). "Inference" is the server-reported inference time; excludes HTTP/poll overhead.

Preset	Server Command	Inference (s)	Speedup	Type
`baseline`	`--enforce-eager`	10.3	1.00×	—
`compile`	(default, no --enforce-eager)	~10.3 (warm)	~1.00×	Lossless
`ulysses4`	`--enforce-eager --usp 4`	~10.3	~1.00×	Lossless
`cache-dit`	`--enforce-eager --cache-backend cache_dit`	7.4 avg	~1.4×	Lossy
`best-combo`	`--enforce-eager --usp 4 --cache-backend cache_dit`	4.7 avg	~2.2×	Lossless + Lossy

Observations: - torch.compile: On H800, warm-request inference time matches the eager baseline (~10.3s). The first request pays ~6s compilation overhead. Benefit depends on model architecture and GPU. - Ulysses SP (4 GPU): No measurable speedup alone for 41-frame generation at this resolution. Communication overhead outweighs gains at this sequence length. - Cache-DiT: Inference varies per request (6–10s) due to dynamic caching decisions. Average is ~7.4s (~1.4× speedup) with slight quality tradeoff. - Best combo: 4-GPU Ulysses SP + Cache-DiT synergize well — Cache-DiT reduces per-step computation, making the communication overhead of Ulysses SP worthwhile. Average ~4.7s (~2.2× speedup). - FP8 quantization: Reduces VRAM but does not speed up LTX-2 on H800 (compute-bound).

Deployment Recommendations: - For production with quality priority: use baseline with --enforce-eager - For maximum throughput (4 GPUs, quality tradeoff): use best-combo (~2.2× speedup) - For single-GPU throughput: use cache-dit (~1.4× speedup) - --enforce-eager is recommended to avoid torch.compile warmup latency on first request

Send Requests (curl)¶

# Using the provided script
bash run_curl_ltx2.sh

# Or directly
curl -sS -X POST http://localhost:8098/v1/videos \
  -H "Accept: application/json" \
  -F "prompt=A serene lakeside sunrise with mist over the water." \
  -F "width=768" \
  -F "height=480" \
  -F "num_frames=41" \
  -F "fps=24" \
  -F "num_inference_steps=20" \
  -F "guidance_scale=3.0" \
  -F "seed=42"

Helios¶

Helios ships three variants (Helios-Base, Helios-Mid, Helios-Distilled) that share the same server launch. Variant-specific knobs (declared in vllm_omni/model_extras/helios.py) are sent per request through the generic extra_params JSON form field — no per-model server flags required.

Start Server¶

vllm serve BestWishYsh/Helios-Base --omni --port 8098
# or: MODEL=BestWishYsh/Helios-Mid bash run_server_helios.sh

Send Requests (curl)¶

# Helios-Base (Stage 1 only)
bash run_curl_helios.sh

# Helios-Mid (Stage 2 pyramid + CFG-Zero*)
PRESET=mid-stage2 MODEL=BestWishYsh/Helios-Mid bash run_curl_helios.sh

# Helios-Distilled (Stage 2 pyramid + DMD, few-step)
PRESET=distilled MODEL=BestWishYsh/Helios-Distilled bash run_curl_helios.sh

The mid-stage2 and distilled presets attach an extra_params field, e.g. for Helios-Distilled:

curl -sS -X POST http://localhost:8098/v1/videos \
  -H "Accept: application/json" \
  -F "prompt=A dynamic time-lapse of scenery rushing past the window of a speeding train." \
  -F "model=BestWishYsh/Helios-Distilled" \
  -F "size=640x384" \
  -F "num_frames=99" \
  -F "fps=16" \
  -F "guidance_scale=1.0" \
  -F "seed=42" \
  -F 'extra_params={"is_enable_stage2": true, "pyramid_num_inference_steps_list": [2, 2, 2], "is_amplify_first_chunk": true}'

Example materials¶

run_curl_helios.sh

#!/bin/bash
# Helios text-to-video curl example using the async video job API.
#
# Helios-specific knobs (declared in vllm_omni/model_extras/helios.py) are passed
# through the generic `extra_params` JSON form field. Select a variant via PRESET:
#   PRESET=base        -> Helios-Base, Stage 1 only (default)
#   PRESET=mid-stage2  -> Helios-Mid, Stage 2 pyramid + CFG-Zero*
#   PRESET=distilled   -> Helios-Distilled, Stage 2 pyramid + DMD (few-step)

set -euo pipefail

BASE_URL="${BASE_URL:-http://localhost:8098}"
MODEL="${MODEL:-BestWishYsh/Helios-Base}"
PROMPT="${PROMPT:-A dynamic time-lapse of scenery rushing past the window of a speeding train.}"
PRESET="${PRESET:-base}"
OUTPUT_PATH="${OUTPUT_PATH:-helios_t2v_${PRESET}.mp4}"
POLL_INTERVAL="${POLL_INTERVAL:-2}"

case "${PRESET}" in
  base)
    EXTRA_PARAMS=""
    GUIDANCE_SCALE="5.0"
    ;;
  mid-stage2)
    EXTRA_PARAMS='{"is_enable_stage2": true, "pyramid_num_inference_steps_list": [20, 20, 20], "use_cfg_zero_star": true, "use_zero_init": true, "zero_steps": 1}'
    GUIDANCE_SCALE="5.0"
    ;;
  distilled)
    EXTRA_PARAMS='{"is_enable_stage2": true, "pyramid_num_inference_steps_list": [2, 2, 2], "is_amplify_first_chunk": true}'
    GUIDANCE_SCALE="1.0"
    ;;
  *)
    echo "Unknown PRESET '${PRESET}' (expected base|mid-stage2|distilled)"
    exit 1
    ;;
esac

create_args=(
  -sS -X POST "${BASE_URL}/v1/videos"
  -H "Accept: application/json"
  -F "prompt=${PROMPT}"
  -F "model=${MODEL}"
  -F "size=640x384"
  -F "num_frames=99"
  -F "fps=16"
  -F "num_inference_steps=50"
  -F "guidance_scale=${GUIDANCE_SCALE}"
  -F "seed=42"
)
if [ -n "${EXTRA_PARAMS}" ]; then
  create_args+=(-F "extra_params=${EXTRA_PARAMS}")
fi

create_response=$(curl "${create_args[@]}")

video_id="$(echo "${create_response}" | jq -r '.id')"
if [ -z "${video_id}" ] || [ "${video_id}" = "null" ]; then
  echo "Failed to create video job:"
  echo "${create_response}" | jq .
  exit 1
fi

echo "Created video job ${video_id} (preset=${PRESET})"
echo "${create_response}" | jq .

while true; do
  status_response="$(curl -sS "${BASE_URL}/v1/videos/${video_id}")"
  status="$(echo "${status_response}" | jq -r '.status')"

  case "${status}" in
    queued|in_progress)
      echo "Video job ${video_id} status: ${status}"
      sleep "${POLL_INTERVAL}"
      ;;
    completed)
      echo "${status_response}" | jq .
      break
      ;;
    failed)
      echo "Video generation failed:"
      echo "${status_response}" | jq .
      exit 1
      ;;
    *)
      echo "Unexpected status response:"
      echo "${status_response}" | jq .
      exit 1
      ;;
  esac
done

curl -sS -L "${BASE_URL}/v1/videos/${video_id}/content" -o "${OUTPUT_PATH}"
echo "Saved video to ${OUTPUT_PATH}"

run_curl_hunyuan_video_15.sh

#!/bin/bash
# HunyuanVideo-1.5 text-to-video curl example using the async video job API.

set -euo pipefail

BASE_URL="${BASE_URL:-http://localhost:8098}"
OUTPUT_PATH="${OUTPUT_PATH:-hunyuan_video_15_t2v.mp4}"
POLL_INTERVAL="${POLL_INTERVAL:-2}"

create_response=$(
  curl -sS -X POST "${BASE_URL}/v1/videos" \
    -H "Accept: application/json" \
    -F "prompt=A little girl wearing a straw hat runs through a summer meadow full of wildflowers. A wide shot is used, with the camera panning right to follow her." \
    -F "size=832x480" \
    -F "num_frames=33" \
    -F "fps=24" \
    -F "num_inference_steps=30" \
    -F "guidance_scale=6.0" \
    -F "flow_shift=5.0" \
    -F "seed=42"
)

video_id="$(echo "${create_response}" | jq -r '.id')"
if [ -z "${video_id}" ] || [ "${video_id}" = "null" ]; then
  echo "Failed to create video job:"
  echo "${create_response}" | jq .
  exit 1
fi

echo "Created video job ${video_id}"
echo "${create_response}" | jq .

while true; do
  status_response="$(curl -sS "${BASE_URL}/v1/videos/${video_id}")"
  status="$(echo "${status_response}" | jq -r '.status')"

  case "${status}" in
    queued|in_progress)
      echo "Video job ${video_id} status: ${status}"
      sleep "${POLL_INTERVAL}"
      ;;
    completed)
      echo "${status_response}" | jq .
      break
      ;;
    failed)
      echo "Video generation failed:"
      echo "${status_response}" | jq .
      exit 1
      ;;
    *)
      echo "Unexpected status response:"
      echo "${status_response}" | jq .
      exit 1
      ;;
  esac
done

curl -sS -L "${BASE_URL}/v1/videos/${video_id}/content" -o "${OUTPUT_PATH}"
echo "Saved video to ${OUTPUT_PATH}"

run_curl_ltx2.sh

#!/bin/bash
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
#
# LTX-2 text-to-video curl example using the async video job API.
# Start the server first: bash run_server_ltx2.sh best-combo

set -euo pipefail

BASE_URL="${BASE_URL:-http://localhost:8098}"
OUTPUT_PATH="${OUTPUT_PATH:-ltx2_output.mp4}"
POLL_INTERVAL="${POLL_INTERVAL:-2}"

PROMPT="${PROMPT:-A serene lakeside sunrise with mist over the water.}"

create_response=$(
  curl -sS -X POST "${BASE_URL}/v1/videos" \
    -H "Accept: application/json" \
    -F "prompt=${PROMPT}" \
    -F "width=768" \
    -F "height=480" \
    -F "num_frames=41" \
    -F "fps=24" \
    -F "num_inference_steps=20" \
    -F "guidance_scale=3.0" \
    -F "seed=42"
)

video_id="$(echo "${create_response}" | jq -r '.id')"
if [ -z "${video_id}" ] || [ "${video_id}" = "null" ]; then
  echo "Failed to create video job:"
  echo "${create_response}" | jq .
  exit 1
fi

echo "Created video job ${video_id}"
echo "${create_response}" | jq .

while true; do
  status_response="$(curl -sS "${BASE_URL}/v1/videos/${video_id}")"
  status="$(echo "${status_response}" | jq -r '.status')"

  case "${status}" in
    queued|in_progress)
      echo "Video job ${video_id} status: ${status}"
      sleep "${POLL_INTERVAL}"
      ;;
    completed)
      echo "${status_response}" | jq .
      break
      ;;
    failed)
      echo "Video generation failed:"
      echo "${status_response}" | jq .
      exit 1
      ;;
    *)
      echo "Unexpected status response:"
      echo "${status_response}" | jq .
      exit 1
      ;;
  esac
done

curl -sS -L "${BASE_URL}/v1/videos/${video_id}/content" -o "${OUTPUT_PATH}"
echo "Saved video to ${OUTPUT_PATH}"

run_curl_text_to_video.sh

#!/bin/bash
# Wan2.2 text-to-video curl example using the async video job API.

set -euo pipefail

BASE_URL="${BASE_URL:-http://localhost:8098}"
OUTPUT_PATH="${OUTPUT_PATH:-wan22_output.mp4}"
POLL_INTERVAL="${POLL_INTERVAL:-2}"

create_response=$(
  curl -sS -X POST "${BASE_URL}/v1/videos" \
    -H "Accept: application/json" \
    -F "prompt=Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
    -F "seconds=2" \
    -F "size=832x480" \
    -F "negative_prompt=色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" \
    -F "fps=16" \
    -F "num_inference_steps=40" \
    -F "guidance_scale=4.0" \
    -F "guidance_scale_2=4.0" \
    -F "boundary_ratio=0.875" \
    -F "flow_shift=5.0" \
    -F "seed=42"
)

video_id="$(echo "${create_response}" | jq -r '.id')"
if [ -z "${video_id}" ] || [ "${video_id}" = "null" ]; then
  echo "Failed to create video job:"
  echo "${create_response}" | jq .
  exit 1
fi

echo "Created video job ${video_id}"
echo "${create_response}" | jq .

while true; do
  status_response="$(curl -sS "${BASE_URL}/v1/videos/${video_id}")"
  status="$(echo "${status_response}" | jq -r '.status')"

  case "${status}" in
    queued|in_progress)
      echo "Video job ${video_id} status: ${status}"
      sleep "${POLL_INTERVAL}"
      ;;
    completed)
      echo "${status_response}" | jq .
      break
      ;;
    failed)
      echo "Video generation failed:"
      echo "${status_response}" | jq .
      exit 1
      ;;
    *)
      echo "Unexpected status response:"
      echo "${status_response}" | jq .
      exit 1
      ;;
  esac
done

curl -sS -L "${BASE_URL}/v1/videos/${video_id}/content" -o "${OUTPUT_PATH}"
echo "Saved video to ${OUTPUT_PATH}"

run_server.sh

#!/bin/bash
# Wan2.2 online serving startup script

MODEL="${MODEL:-Wan-AI/Wan2.2-T2V-A14B-Diffusers}"
PORT="${PORT:-8098}"
BOUNDARY_RATIO="${BOUNDARY_RATIO:-0.875}"
FLOW_SHIFT="${FLOW_SHIFT:-5.0}"
CACHE_BACKEND="${CACHE_BACKEND:-none}"
ENABLE_CACHE_DIT_SUMMARY="${ENABLE_CACHE_DIT_SUMMARY:-0}"

echo "Starting Wan2.2 server..."
echo "Model: $MODEL"
echo "Port: $PORT"
echo "Boundary ratio: $BOUNDARY_RATIO"
echo "Flow shift: $FLOW_SHIFT"
echo "Cache backend: $CACHE_BACKEND"
if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then
    echo "Cache-DiT summary: enabled"
fi

CACHE_BACKEND_FLAG=""
if [ "$CACHE_BACKEND" != "none" ]; then
    CACHE_BACKEND_FLAG="--cache-backend $CACHE_BACKEND"
fi

vllm serve "$MODEL" --omni \
    --port "$PORT" \
    --boundary-ratio "$BOUNDARY_RATIO" \
    --flow-shift "$FLOW_SHIFT" \
    $CACHE_BACKEND_FLAG \
    $(if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then echo "--enable-cache-dit-summary"; fi)

run_server_helios.sh

#!/bin/bash
# Helios online serving startup script.
# All three variants (Helios-Base / Helios-Mid / Helios-Distilled) share the same
# server launch; variant-specific knobs are sent per-request via `extra_params`
# (see run_curl_helios.sh).

MODEL="${MODEL:-BestWishYsh/Helios-Base}"
PORT="${PORT:-8098}"

echo "Starting Helios server..."
echo "Model: $MODEL"
echo "Port: $PORT"

vllm serve "$MODEL" --omni \
    --port "$PORT"

run_server_hunyuan_video_15.sh

#!/bin/bash
# HunyuanVideo-1.5 text-to-video online serving startup script
#
# 480p: ~35 GB VRAM (BF16), fits 1x A100 80GB
# 720p: needs FP8 + VAE tiling, ~35 GB VRAM

MODEL="${MODEL:-hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v}"
PORT="${PORT:-8098}"
FLOW_SHIFT="${FLOW_SHIFT:-5.0}"
QUANTIZATION="${QUANTIZATION:-}"
CACHE_BACKEND="${CACHE_BACKEND:-none}"

echo "Starting HunyuanVideo-1.5 T2V server..."
echo "Model: $MODEL"
echo "Port: $PORT"
echo "Flow shift: $FLOW_SHIFT"
echo "Quantization: ${QUANTIZATION:-none}"
echo "Cache backend: $CACHE_BACKEND"

EXTRA_FLAGS=""
if [ -n "$QUANTIZATION" ]; then
    EXTRA_FLAGS="$EXTRA_FLAGS --quantization $QUANTIZATION"
fi
if [ "$CACHE_BACKEND" != "none" ]; then
    EXTRA_FLAGS="$EXTRA_FLAGS --cache-backend $CACHE_BACKEND"
fi

vllm serve "$MODEL" --omni \
    --port "$PORT" \
    --flow-shift "$FLOW_SHIFT" \
    $EXTRA_FLAGS

run_server_ltx2.sh

#!/bin/bash
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
#
# LTX-2 online serving startup script with optimization presets.
#
# Usage:
#   bash run_server_ltx2.sh                  # baseline (1 GPU, eager)
#   bash run_server_ltx2.sh ulysses4         # 4-GPU Ulysses SP
#   bash run_server_ltx2.sh cache-dit        # 1 GPU + Cache-DiT
#   bash run_server_ltx2.sh best-combo       # 4-GPU Ulysses SP + Cache-DiT
#
# Online serving benchmarks on H800 (480×768, 41 frames, 20 steps):
#   baseline    : 10.3s inference (1.00×)
#   compile     : ~10.3s warm     (~1.00×) first request +6s warmup
#   ulysses4    : ~10.3s          (~1.00×) no gain at 41 frames
#   cache-dit   :  7.4s avg       (~1.4×)  lossy, variable per request
#   best-combo  :  4.7s avg       (~2.2×)  4-GPU ulysses + cache-dit

set -euo pipefail

MODEL="${MODEL:-Lightricks/LTX-2}"
PORT="${PORT:-8098}"
FLOW_SHIFT="${FLOW_SHIFT:-1.0}"
BOUNDARY_RATIO="${BOUNDARY_RATIO:-1.0}"

PRESET="${1:-baseline}"

EXTRA_ARGS=()
case "$PRESET" in
    baseline)
        echo "=== LTX-2 Preset: baseline (1 GPU, enforce-eager) ==="
        EXTRA_ARGS+=(--enforce-eager)
        ;;
    ulysses2)
        echo "=== LTX-2 Preset: 2-GPU Ulysses SP (lossless) ==="
        EXTRA_ARGS+=(--enforce-eager --usp 2)
        ;;
    ulysses4)
        echo "=== LTX-2 Preset: 4-GPU Ulysses SP (lossless) ==="
        EXTRA_ARGS+=(--enforce-eager --usp 4)
        ;;
    cache-dit)
        echo "=== LTX-2 Preset: Cache-DiT (1 GPU, lossy) ==="
        EXTRA_ARGS+=(--enforce-eager --cache-backend cache_dit)
        ;;
    best-combo)
        echo "=== LTX-2 Preset: 4-GPU Ulysses SP + Cache-DiT (best combo) ==="
        EXTRA_ARGS+=(--enforce-eager --usp 4 --cache-backend cache_dit)
        ;;
    compile)
        echo "=== LTX-2 Preset: torch.compile (1 GPU, lossless) ==="
        # torch.compile is the default (no --enforce-eager)
        ;;
    *)
        echo "Usage: $0 {baseline|ulysses2|ulysses4|cache-dit|best-combo|compile}"
        echo ""
        echo "Presets:"
        echo "  baseline    - 1 GPU, eager execution (reference)"
        echo "  ulysses2    - 2-GPU Ulysses SP (lossless)"
        echo "  ulysses4    - 4-GPU Ulysses SP (lossless)"
        echo "  cache-dit   - 1 GPU + Cache-DiT (lossy, ~1.4× speedup)"
        echo "  best-combo  - 4-GPU Ulysses SP + Cache-DiT (~2.2× speedup)"
        echo "  compile     - 1 GPU + torch.compile (slower first request)"
        echo ""
        echo "Environment variables:"
        echo "  MODEL           - Model path (default: Lightricks/LTX-2)"
        echo "  PORT            - Server port (default: 8098)"
        echo "  FLOW_SHIFT      - Scheduler flow shift (default: 1.0)"
        echo "  BOUNDARY_RATIO  - Boundary ratio (default: 1.0)"
        exit 1
        ;;
esac

echo "Model: $MODEL"
echo "Port: $PORT"
echo "Flow shift: $FLOW_SHIFT"
echo "Boundary ratio: $BOUNDARY_RATIO"

vllm serve "$MODEL" --omni \
    --port "$PORT" \
    --flow-shift "$FLOW_SHIFT" \
    --boundary-ratio "$BOUNDARY_RATIO" \
    "${EXTRA_ARGS[@]}"