Skip to content

Image-To-Video

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/image_to_video.

This example demonstrates how to generate videos from images using Wan2.2 Image-to-Video models with vLLM-Omni's offline inference API.

Local CLI Usage

Download the example image:

wget https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg

Wan2.2-I2V-A14B-Diffusers (MoE)

python image_to_video.py \
  --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
  --image cherry_blossom.jpg \
  --prompt "Cherry blossoms swaying gently in the breeze, petals falling, smooth motion" \
  --negative-prompt "<optional quality filter>" \
  --height 480 \
  --width 832 \
  --num-frames 48 \
  --guidance-scale 5.0 \
  --guidance-scale-high 6.0 \
  --num-inference-steps 40 \
  --boundary-ratio 0.875 \
  --flow-shift 12.0 \
  --fps 16 \
  --output i2v_output.mp4

Wan2.2-TI2V-5B-Diffusers (Unified)

python image_to_video.py \
  --model Wan-AI/Wan2.2-TI2V-5B-Diffusers \
  --image cherry_blossom.jpg \
  --prompt "Cherry blossoms swaying gently in the breeze, petals falling, smooth motion" \
  --negative-prompt "<optional quality filter>" \
  --height 480 \
  --width 832 \
  --num-frames 48 \
  --guidance-scale 4.0 \
  --num-inference-steps 40 \
  --flow-shift 12.0 \
  --fps 16 \
  --output i2v_output.mp4

Key arguments:

  • --model: Model ID (I2V-A14B for MoE, TI2V-5B for unified T2V+I2V).
  • --image: Path to input image (required).
  • --prompt: Text description of desired motion/animation.
  • --height/--width: Output resolution (auto-calculated from image if not set). Dimensions should be multiples of 16.
  • --num-frames: Number of frames (default 81).
  • --guidance-scale and --guidance-scale-high: CFG scale (applied to low/high-noise stages for MoE).
  • --negative-prompt: Optional list of artifacts to suppress.
  • --boundary-ratio: Boundary split ratio for two-stage MoE models.
  • --flow-shift: Scheduler flow shift (5.0 for 720p, 12.0 for 480p).
  • --sample-solver: Wan2.2 sampling solver. Use unipc for the default multistep solver, or euler for Lightning/Distill checkpoints.
  • --num-inference-steps: Number of denoising steps (default 50).
  • --fps: Frames per second for the saved MP4 (requires diffusers export_to_video).
  • --output: Path to save the generated video.
  • --vae-use-slicing: Enable VAE slicing for memory optimization.
  • --vae-use-tiling: Enable VAE tiling for memory optimization.
  • --cfg-parallel-size: set it to 2 to enable CFG Parallel. See more examples in user_guide.
  • --tensor-parallel-size: tensor parallel size (effective for models that support TP, e.g. LTX2).
  • --enable-cpu-offload: enable CPU offloading for diffusion models.
  • --use-hsdp: Enable Hybrid Sharded Data Parallel to shard model weights across GPUs.
  • --hsdp-shard-size: Number of GPUs to shard model weights across within each replica group. -1 (default) auto-calculates as world_size / replicate_size.
  • --hsdp-replicate-size: Number of replica groups for HSDP. Each replica holds a full sharded copy. Default 1 means pure sharding (no replication).

ℹ️ If you encounter OOM errors, try using --vae-use-slicing and --vae-use-tiling to reduce memory usage.

For Wan2.2 LightX2V-converted local Diffusers directories and related LoRA assets, see the LoRA guide.

Example materials

image_to_video.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/image_to_video/image_to_video.py.