Speech-To-Video¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/speech_to_video.
This example demonstrates how to deploy the Wan2.2 speech-to-video (S2V) model for online video generation using vLLM-Omni.
Supported Models¶
| Model | Model ID |
|---|---|
| Wan2.2 S2V (14B) | Wan-AI/Wan2.2-S2V-14B |
Start Server¶
Basic Start¶
VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve Wan-AI/Wan2.2-S2V-14B --omni \
--model-class-name WanS2VPipeline \
--tensor-parallel-size 2 \
--flow-shift 3.0 \
--vae-use-slicing --vae-use-tiling \
--cache-backend cache_dit \
--port 8091
Start with Script¶
The script allows overriding: - MODEL (default: Wan-AI/Wan2.2-S2V-14B) - PORT (default: 8091) - FLOW_SHIFT (default: 3.0) - TP (default: 2) - CACHE_BACKEND (default: cache_dit)
Start with 4 GPUs¶
Sync API (Recommended for Testing)¶
POST /v1/videos/sync blocks until generation completes and returns the raw video bytes (video/mp4) directly in the response body.
Async Job API¶
POST /v1/videos creates an asynchronous job and returns immediately.
no_proxy=127.0.0.1 \
create_response=$(curl -sS -X POST http://127.0.0.1:8091/v1/videos \
-H "Accept: application/json" \
-F "prompt=A person singing" \
-F 'image_reference={"image_url": "https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.png"}' \
-F 'audio_reference={"audio_url": "https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.MP3"}' \
-F "width=832" -F "height=480" \
-F "num_inference_steps=40" \
-F "guidance_scale=4.5" \
-F "fps=16")
video_id=$(echo "$create_response" | jq -r '.id')
# Poll until complete
while true; do
status=$(curl -s "http://127.0.0.1:8091/v1/videos/${video_id}" | jq -r '.status')
if [ "$status" = "completed" ]; then
break
fi
if [ "$status" = "failed" ]; then
echo "Video generation failed"
exit 1
fi
sleep 2
done
# Download
curl -L "http://127.0.0.1:8091/v1/videos/${video_id}/content" -o s2v_output.mp4
Request Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt | str | - | Text description of the desired video |
image_reference | JSON str | - | Image reference: {"image_url": "..."} — supports HTTP(s) URLs or base64 data URLs |
audio_reference | JSON str | - | Audio reference: {"audio_url": "..."} — supports HTTP(s) URLs or base64 data URLs |
width | int | None | Video width in pixels |
height | int | None | Video height in pixels |
fps | int | None | Output video frame rate |
num_inference_steps | int | None | Number of denoising steps (40 recommended) |
guidance_scale | float | None | CFG guidance scale |
seed | int | None | Random seed for reproducibility |
Notes¶
- S2V requires both a reference image (
image_reference) and an audio reference (audio_reference). The generated video will show a person matching the reference image with lip movements synchronized to the audio. audio_referenceaccepts a JSON string:{"audio_url": "..."}where the URL can be:- An HTTP/HTTPS URL (e.g.,
https://example.com/audio.mp3) - A base64 data URL (e.g.,
data:audio/mp3;base64,...) --model-class-name WanS2VPipelineis required on the server to select the S2V pipeline (distinct from the T2V/I2V pipelines).--cache-backend cache_ditenables DiT caching for ~2x speedup on cached steps.- Audio is muxed into the output MP4 automatically.
- Always pass
fps=16to ensure correct video/audio alignment.
Example materials¶
run_curl_speech_to_video.sh
#!/bin/bash
# Wan2.2 S2V (speech-to-video) curl example using the sync video API.
set -euo pipefail
BASE_URL="${BASE_URL:-http://localhost:8091}"
OUTPUT_PATH="${OUTPUT_PATH:-s2v_480p_serve.mp4}"
IMAGE_URL="${IMAGE_URL:-https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.png}"
AUDIO_URL="${AUDIO_URL:-https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.MP3}"
PROMPT="${PROMPT:-A person singing}"
WIDTH="${WIDTH:-832}"
HEIGHT="${HEIGHT:-480}"
NUM_INFERENCE_STEPS="${NUM_INFERENCE_STEPS:-40}"
GUIDANCE_SCALE="${GUIDANCE_SCALE:-4.5}"
FPS="${FPS:-16}"
echo "Sending S2V request..."
echo " Image URL: $IMAGE_URL"
echo " Audio URL: $AUDIO_URL"
echo " Prompt: $PROMPT"
echo " Resolution: ${WIDTH}x${HEIGHT}"
echo " Steps: $NUM_INFERENCE_STEPS"
echo " FPS: $FPS"
IMAGE_REF_JSON="{\"image_url\": \"${IMAGE_URL}\"}"
AUDIO_REF_JSON="{\"audio_url\": \"${AUDIO_URL}\"}"
no_proxy=127.0.0.1 \
curl -X POST "${BASE_URL}/v1/videos/sync" \
-F "prompt=${PROMPT}" \
-F "image_reference=${IMAGE_REF_JSON}" \
-F "audio_reference=${AUDIO_REF_JSON}" \
-F "width=${WIDTH}" -F "height=${HEIGHT}" \
-F "num_inference_steps=${NUM_INFERENCE_STEPS}" \
-F "guidance_scale=${GUIDANCE_SCALE}" \
-F "fps=${FPS}" \
--output "${OUTPUT_PATH}"
if [ -f "$OUTPUT_PATH" ] && [ -s "$OUTPUT_PATH" ]; then
echo "Saved video to ${OUTPUT_PATH} ($(du -h "$OUTPUT_PATH" | cut -f1))"
else
echo "ERROR: Output file is empty or missing"
exit 1
fi
run_server.sh
#!/bin/bash
# Wan2.2 S2V (speech-to-video) online serving startup script
MODEL="${MODEL:-Wan-AI/Wan2.2-S2V-14B}"
PORT="${PORT:-8091}"
FLOW_SHIFT="${FLOW_SHIFT:-3.0}"
TP="${TP:-2}"
CACHE_BACKEND="${CACHE_BACKEND:-cache_dit}"
echo "Starting Wan2.2 S2V server..."
echo "Model: $MODEL"
echo "Port: $PORT"
echo "Flow shift: $FLOW_SHIFT"
echo "Tensor parallel size: $TP"
echo "Cache backend: $CACHE_BACKEND"
CACHE_BACKEND_FLAG=""
if [ "$CACHE_BACKEND" != "none" ]; then
CACHE_BACKEND_FLAG="--cache-backend $CACHE_BACKEND"
fi
VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve "$MODEL" --omni \
--model-class-name WanS2VPipeline \
--tensor-parallel-size "$TP" \
--flow-shift "$FLOW_SHIFT" \
--vae-use-slicing --vae-use-tiling \
--port "$PORT" \
$CACHE_BACKEND_FLAG