Speech-To-Video¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/speech_to_video.

This example demonstrates how to deploy the Wan2.2 speech-to-video (S2V) model for online video generation using vLLM-Omni.

Supported Models¶

Model	Model ID
Wan2.2 S2V (14B)	`Wan-AI/Wan2.2-S2V-14B`

Start Server¶

Basic Start¶

VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve Wan-AI/Wan2.2-S2V-14B --omni \
  --model-class-name WanS2VPipeline \
  --tensor-parallel-size 2 \
  --flow-shift 3.0 \
  --vae-use-slicing --vae-use-tiling \
  --cache-backend cache_dit \
  --port 8091

Start with Script¶

bash run_server.sh

The script allows overriding: - MODEL (default: Wan-AI/Wan2.2-S2V-14B) - PORT (default: 8091) - FLOW_SHIFT (default: 3.0) - TP (default: 2) - CACHE_BACKEND (default: cache_dit)

Start with 4 GPUs¶

TP=4 bash run_server.sh

Sync API (Recommended for Testing)¶

POST /v1/videos/sync blocks until generation completes and returns the raw video bytes (video/mp4) directly in the response body.

bash run_curl_speech_to_video.sh

Async Job API¶

POST /v1/videos creates an asynchronous job and returns immediately.

no_proxy=127.0.0.1 \
create_response=$(curl -sS -X POST http://127.0.0.1:8091/v1/videos \
  -H "Accept: application/json" \
  -F "prompt=A person singing" \
  -F 'image_reference={"image_url": "https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.png"}' \
  -F 'audio_reference={"audio_url": "https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.MP3"}' \
  -F "width=832" -F "height=480" \
  -F "num_inference_steps=40" \
  -F "guidance_scale=4.5" \
  -F "fps=16")

video_id=$(echo "$create_response" | jq -r '.id')

# Poll until complete
while true; do
  status=$(curl -s "http://127.0.0.1:8091/v1/videos/${video_id}" | jq -r '.status')
  if [ "$status" = "completed" ]; then
    break
  fi
  if [ "$status" = "failed" ]; then
    echo "Video generation failed"
    exit 1
  fi
  sleep 2
done

# Download
curl -L "http://127.0.0.1:8091/v1/videos/${video_id}/content" -o s2v_output.mp4

Request Parameters¶

Parameter	Type	Default	Description
`prompt`	str	-	Text description of the desired video
`image_reference`	JSON str	-	Image reference: `{"image_url": "..."}` — supports HTTP(s) URLs or base64 data URLs
`audio_reference`	JSON str	-	Audio reference: `{"audio_url": "..."}` — supports HTTP(s) URLs or base64 data URLs
`width`	int	None	Video width in pixels
`height`	int	None	Video height in pixels
`fps`	int	None	Output video frame rate
`num_inference_steps`	int	None	Number of denoising steps (40 recommended)
`guidance_scale`	float	None	CFG guidance scale
`seed`	int	None	Random seed for reproducibility

Notes¶

S2V requires both a reference image (image_reference) and an audio reference (audio_reference). The generated video will show a person matching the reference image with lip movements synchronized to the audio.
audio_reference accepts a JSON string: {"audio_url": "..."} where the URL can be:
An HTTP/HTTPS URL (e.g., https://example.com/audio.mp3)
A base64 data URL (e.g., data:audio/mp3;base64,...)
--model-class-name WanS2VPipeline is required on the server to select the S2V pipeline (distinct from the T2V/I2V pipelines).
--cache-backend cache_dit enables DiT caching for ~2x speedup on cached steps.
Audio is muxed into the output MP4 automatically.
Always pass fps=16 to ensure correct video/audio alignment.

Example materials¶

run_curl_speech_to_video.sh

#!/bin/bash
# Wan2.2 S2V (speech-to-video) curl example using the sync video API.

set -euo pipefail

BASE_URL="${BASE_URL:-http://localhost:8091}"
OUTPUT_PATH="${OUTPUT_PATH:-s2v_480p_serve.mp4}"
IMAGE_URL="${IMAGE_URL:-https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.png}"
AUDIO_URL="${AUDIO_URL:-https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.MP3}"
PROMPT="${PROMPT:-A person singing}"
WIDTH="${WIDTH:-832}"
HEIGHT="${HEIGHT:-480}"
NUM_INFERENCE_STEPS="${NUM_INFERENCE_STEPS:-40}"
GUIDANCE_SCALE="${GUIDANCE_SCALE:-4.5}"
FPS="${FPS:-16}"

echo "Sending S2V request..."
echo "  Image URL: $IMAGE_URL"
echo "  Audio URL: $AUDIO_URL"
echo "  Prompt: $PROMPT"
echo "  Resolution: ${WIDTH}x${HEIGHT}"
echo "  Steps: $NUM_INFERENCE_STEPS"
echo "  FPS: $FPS"

IMAGE_REF_JSON="{\"image_url\": \"${IMAGE_URL}\"}"
AUDIO_REF_JSON="{\"audio_url\": \"${AUDIO_URL}\"}"

no_proxy=127.0.0.1 \
curl -X POST "${BASE_URL}/v1/videos/sync" \
  -F "prompt=${PROMPT}" \
  -F "image_reference=${IMAGE_REF_JSON}" \
  -F "audio_reference=${AUDIO_REF_JSON}" \
  -F "width=${WIDTH}" -F "height=${HEIGHT}" \
  -F "num_inference_steps=${NUM_INFERENCE_STEPS}" \
  -F "guidance_scale=${GUIDANCE_SCALE}" \
  -F "fps=${FPS}" \
  --output "${OUTPUT_PATH}"

if [ -f "$OUTPUT_PATH" ] && [ -s "$OUTPUT_PATH" ]; then
    echo "Saved video to ${OUTPUT_PATH} ($(du -h "$OUTPUT_PATH" | cut -f1))"
else
    echo "ERROR: Output file is empty or missing"
    exit 1
fi

run_server.sh

#!/bin/bash
# Wan2.2 S2V (speech-to-video) online serving startup script

MODEL="${MODEL:-Wan-AI/Wan2.2-S2V-14B}"
PORT="${PORT:-8091}"
FLOW_SHIFT="${FLOW_SHIFT:-3.0}"
TP="${TP:-2}"
CACHE_BACKEND="${CACHE_BACKEND:-cache_dit}"

echo "Starting Wan2.2 S2V server..."
echo "Model: $MODEL"
echo "Port: $PORT"
echo "Flow shift: $FLOW_SHIFT"
echo "Tensor parallel size: $TP"
echo "Cache backend: $CACHE_BACKEND"

CACHE_BACKEND_FLAG=""
if [ "$CACHE_BACKEND" != "none" ]; then
    CACHE_BACKEND_FLAG="--cache-backend $CACHE_BACKEND"
fi

VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve "$MODEL" --omni \
    --model-class-name WanS2VPipeline \
    --tensor-parallel-size "$TP" \
    --flow-shift "$FLOW_SHIFT" \
    --vae-use-slicing --vae-use-tiling \
    --port "$PORT" \
    $CACHE_BACKEND_FLAG