VACE Video Generation¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/vace.

Generate videos from text prompts, images, or video conditions using vLLM-Omni's VACE diffusion pipeline.

vace_video_generation.py: command-line script for multi-mode video generation with advanced options.

Overview¶

VACE (Video All-in-one Creation Engine) supports multiple video tasks through a single unified model, including text-to-video, image-to-video, first-last-frame interpolation, inpainting, and reference image-guided generation.

Supported Models¶

Model	Architecture	Peak VRAM (GiB) *	Model Weights (GiB)	HuggingFace
Wan2.1-VACE (1.3B)	Wan2.1	TBD	~10	Wan-AI/Wan2.1-VACE-1.3B-diffusers
Wan2.1-VACE (14B)	Wan2.1	TBD	~38	Wan-AI/Wan2.1-VACE-14B-diffusers

Info

*Peak VRAM: based on basic single-card usage, 480x832 resolution, 81 frames, 30 inference steps, without any acceleration/optimization features.

Default model: Wan-AI/Wan2.1-VACE-14B-diffusers

Quick Start¶

Python API¶

Text-to-video generation:

import torch

from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

if __name__ == "__main__":
    omni = Omni(model="Wan-AI/Wan2.1-VACE-1.3B-diffusers")
    prompt = "A sleek robot stands in a vast warehouse filled with boxes"
    outputs = omni.generate(
        prompt,
        OmniDiffusionSamplingParams(
            height=480,
            width=832,
            num_frames=81,
            num_inference_steps=30,
            guidance_scale=5.0,
        ),
    )
    video = outputs[0].images

    from diffusers.utils import export_to_video
    export_to_video(list(video[0]), "t2v_output.mp4", fps=16)
    omni.close()

Local CLI Usage¶

python vace_video_generation.py \
  --mode t2v \
  --prompt "A sleek robot stands in a vast warehouse filled with boxes" \
  --height 480 --width 832 --num-frames 81 \
  --output t2v_output.mp4

Key Arguments¶

Argument	Type	Default	Description
`--mode`	str	`"t2v"`	VACE task mode: `t2v`, `i2v`, `v2lf`, `flf2v`, `inpaint`, `r2v`
`--model`	str	`"Wan-AI/Wan2.1-VACE-14B-diffusers"`	Model ID or local path
`--prompt`	str	`"A cat walking in a garden"`	Text description of desired video
`--negative-prompt`	str	`""`	Negative prompt for classifier-free guidance
`--image`	str	`None`	Input image path (for I2V, R2V, FLF2V, inpaint modes)
`--last-image`	str	`None`	Last frame image path (for FLF2V mode)
`--height`	int	`480`	Output video height in pixels (should be a multiple of 16)
`--width`	int	`832`	Output video width in pixels (should be a multiple of 16)
`--num-frames`	int	`81`	Number of video frames to generate
`--num-inference-steps`	int	`30`	Number of denoising steps (more steps = higher quality, slower)
`--guidance-scale`	float	`5.0`	Classifier-free guidance scale
`--flow-shift`	float	`5.0`	Scheduler flow shift parameter
`--seed`	int	`42`	Random seed for deterministic sampling
`--fps`	int	`16`	Frames per second for the saved MP4
`--output`	str	`"vace_output.mp4"`	Path to save the generated video
`--vae-use-tiling`	flag	on	Enable VAE tiling for memory optimization
`--cfg-parallel-size`	int	`1`	Set to `2` to enable CFG Parallel
`--ulysses-degree`	int	`1`	Ulysses sequence parallel degree for multi-GPU inference
`--ring-degree`	int	`1`	Ring sequence parallel degree for hybrid Ulysses + Ring inference
`--tensor-parallel-size`	int	—	Tensor parallel size
`--enforce-eager`	flag	off	Disable torch.compile

If you encounter OOM errors, try --vae-use-tiling or multi-GPU parallelism options (--ulysses-degree, --cfg-parallel-size).

More CLI Examples¶

Image-to-Video (I2V)¶

First frame is kept, remaining frames are generated:

python vace_video_generation.py \
  --mode i2v \
  --image astronaut.jpg \
  --prompt "An astronaut emerging from a cracked egg on the moon" \
  --height 480 --width 832 --num-frames 81 \
  --output i2v_output.mp4

First-Last-Frame Interpolation (FLF2V)¶

python vace_video_generation.py \
  --mode flf2v \
  --image first_frame.jpg --last-image last_frame.jpg \
  --prompt "A bird takes off from a branch and lands on another" \
  --height 512 --width 512 --num-frames 81 \
  --output flf2v_output.mp4

Inpainting¶

Center vertical stripe is masked and regenerated:

python vace_video_generation.py \
  --mode inpaint \
  --image scene.jpg \
  --prompt "Shrek walks out of a building" \
  --height 480 --width 832 --num-frames 81 \
  --output inpaint_output.mp4

Reference Image-guided (R2V)¶

python vace_video_generation.py \
  --mode r2v \
  --image reference.jpg \
  --prompt "Camera slowly zooms out from the character" \
  --height 480 --width 832 --num-frames 81 \
  --output r2v_output.mp4

Example materials¶

vace_video_generation.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

"""VACE video generation example.

VACE (Video All-in-one Creation Engine) supports multiple video tasks:
  - T2V:        Text-to-Video
  - I2V:        Image-to-Video (first frame conditioning)
  - V2LF:       Video-to-Last-Frame
  - FLF2V:      First-Last-Frame interpolation
  - Inpainting:  Masked region generation
  - R2V:        Reference image-guided generation

Usage examples:
  # T2V (text-to-video)
  python vace_video_generation.py --mode t2v --prompt "A robot in a warehouse"

  # I2V (image-to-video, first frame kept)
  python vace_video_generation.py --mode i2v --image input.jpg --prompt "..."

  # FLF2V (first-last frame interpolation)
  python vace_video_generation.py --mode flf2v --image first.jpg --last-image last.jpg

  # R2V (reference image guided)
  python vace_video_generation.py --mode r2v --image ref.jpg --prompt "..."
"""

import argparse
import time
from pathlib import Path

import numpy as np
import PIL.Image
import torch

from vllm_omni.diffusion.data import DiffusionParallelConfig
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.platforms import current_omni_platform


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="VACE video generation.")
    parser.add_argument(
        "--model",
        default="Wan-AI/Wan2.1-VACE-14B-diffusers",
        help="VACE model ID or local path.",
    )
    parser.add_argument(
        "--mode",
        default="t2v",
        choices=["t2v", "i2v", "v2lf", "flf2v", "inpaint", "r2v"],
        help="Generation mode.",
    )
    parser.add_argument("--prompt", default="A cat walking in a garden", help="Text prompt.")
    parser.add_argument("--negative-prompt", default="", help="Negative prompt.")
    parser.add_argument("--image", type=str, default=None, help="Input image path (for I2V, R2V, FLF2V, inpaint).")
    parser.add_argument("--last-image", type=str, default=None, help="Last frame image path (for FLF2V).")
    parser.add_argument("--video-dir", type=str, default=None, help="Directory of video frames (for inpaint).")
    parser.add_argument("--seed", type=int, default=42, help="Random seed.")
    parser.add_argument("--guidance-scale", type=float, default=5.0, help="CFG guidance scale.")
    parser.add_argument("--height", type=int, default=480, help="Video height.")
    parser.add_argument("--width", type=int, default=832, help="Video width.")
    parser.add_argument("--num-frames", type=int, default=81, help="Number of frames.")
    parser.add_argument("--num-inference-steps", type=int, default=30, help="Sampling steps.")
    parser.add_argument("--flow-shift", type=float, default=5.0, help="Scheduler flow_shift.")
    parser.add_argument("--output", type=str, default="vace_output.mp4", help="Output video path.")
    parser.add_argument("--fps", type=int, default=16, help="Output video FPS.")
    parser.add_argument("--vae-use-tiling", action="store_true", default=True, help="Enable VAE tiling.")
    parser.add_argument("--enforce-eager", action="store_true", help="Disable torch.compile.")
    parser.add_argument("--ulysses-degree", type=int, default=1, help="Ulysses SP degree.")
    parser.add_argument("--ring-degree", type=int, default=1, help="Ring attention degree.")
    parser.add_argument("--cfg-parallel-size", type=int, default=1, choices=[1, 2], help="CFG parallel size.")
    return parser.parse_args()


def build_prompts(args):
    """Build prompt dict with multi_modal_data based on mode."""
    h, w, nf = args.height, args.width, args.num_frames

    gray = PIL.Image.new("RGB", (w, h), (128, 128, 128))
    mask_black = PIL.Image.new("L", (w, h), 0)
    mask_white = PIL.Image.new("L", (w, h), 255)

    prompt_data = {
        "prompt": args.prompt,
        "negative_prompt": args.negative_prompt,
    }

    if args.mode == "t2v":
        return prompt_data

    if args.mode == "r2v":
        assert args.image, "--image required for R2V mode"
        ref_img = PIL.Image.open(args.image).convert("RGB").resize((w, h))
        prompt_data["multi_modal_data"] = {"reference_images": [ref_img]}
        return prompt_data

    if args.mode == "i2v":
        assert args.image, "--image required for I2V mode"
        img = PIL.Image.open(args.image).convert("RGB").resize((w, h))
        prompt_data["multi_modal_data"] = {
            "video": [img] + [gray] * (nf - 1),
            "mask": [mask_black] + [mask_white] * (nf - 1),
        }
        return prompt_data

    if args.mode == "v2lf":
        assert args.image, "--image required for V2LF mode"
        img = PIL.Image.open(args.image).convert("RGB").resize((w, h))
        prompt_data["multi_modal_data"] = {
            "video": [gray] * (nf - 1) + [img],
            "mask": [mask_white] * (nf - 1) + [mask_black],
        }
        return prompt_data

    if args.mode == "flf2v":
        assert args.image and args.last_image, "--image and --last-image required for FLF2V"
        first = PIL.Image.open(args.image).convert("RGB").resize((w, h))
        last = PIL.Image.open(args.last_image).convert("RGB").resize((w, h))
        prompt_data["multi_modal_data"] = {
            "video": [first] + [gray] * (nf - 2) + [last],
            "mask": [mask_black] + [mask_white] * (nf - 2) + [mask_black],
        }
        return prompt_data

    if args.mode == "inpaint":
        assert args.image, "--image required for inpaint mode"
        img = PIL.Image.open(args.image).convert("RGB").resize((w, h))
        d = 80
        frames, masks = [], []
        for _ in range(nf):
            base = np.array(img).copy()
            mask = PIL.Image.new("L", (w, h), 0)
            stripe = PIL.Image.new("L", (2 * d, h), 255)
            mask.paste(stripe, (w // 2 - d, 0))
            base[np.array(mask) > 128] = 128
            frames.append(PIL.Image.fromarray(base))
            masks.append(mask)
        prompt_data["multi_modal_data"] = {"video": frames, "mask": masks}
        return prompt_data

    raise ValueError(f"Unknown mode: {args.mode}")


def main():
    args = parse_args()
    generator = torch.Generator(device=current_omni_platform.device_type).manual_seed(args.seed)

    parallel_config = DiffusionParallelConfig(
        ulysses_degree=args.ulysses_degree,
        ring_degree=args.ring_degree,
        cfg_parallel_size=args.cfg_parallel_size,
    )

    omni = Omni(
        model=args.model,
        vae_use_tiling=args.vae_use_tiling,
        flow_shift=args.flow_shift,
        enforce_eager=args.enforce_eager,
        parallel_config=parallel_config,
    )

    prompt_data = build_prompts(args)

    print(f"\n{'=' * 60}")
    print(f"VACE {args.mode.upper()} Generation")
    print(f"  Model: {args.model}")
    print(f"  Size: {args.width}x{args.height}, {args.num_frames} frames, {args.num_inference_steps} steps")
    print(f"{'=' * 60}\n")

    start = time.perf_counter()
    outputs = omni.generate(
        prompt_data,
        OmniDiffusionSamplingParams(
            height=args.height,
            width=args.width,
            num_frames=args.num_frames,
            num_inference_steps=args.num_inference_steps,
            guidance_scale=args.guidance_scale,
            generator=generator,
        ),
    )
    elapsed = time.perf_counter() - start

    video = outputs[0].images
    if isinstance(video, list):
        video = video[0]
    if isinstance(video, torch.Tensor):
        video = video.cpu().numpy()
    if video.ndim == 5:
        video = video[0]
    print(f"Output shape: {video.shape}, Time: {elapsed:.1f}s")

    output_path = Path(args.output)
    output_path.parent.mkdir(parents=True, exist_ok=True)

    from diffusers.utils import export_to_video

    if np.issubdtype(video.dtype, np.integer):
        video = video.astype(np.float32) / 255.0
    export_to_video(list(video), str(output_path), fps=args.fps)
    print(f"Saved to {output_path}")

    omni.close()


if __name__ == "__main__":
    main()