VACE Video Generation¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/vace.
Generate videos from text prompts, images, or video conditions using vLLM-Omni's VACE diffusion pipeline.
vace_video_generation.py: command-line script for multi-mode video generation with advanced options.
Table of Contents¶
Overview¶
VACE (Video All-in-one Creation Engine) supports multiple video tasks through a single unified model, including text-to-video, image-to-video, first-last-frame interpolation, inpainting, and reference image-guided generation.
Supported Models¶
| Model | Architecture | Peak VRAM (GiB) * | Model Weights (GiB) | HuggingFace |
|---|---|---|---|---|
| Wan2.1-VACE (1.3B) | Wan2.1 | TBD | ~10 | Wan-AI/Wan2.1-VACE-1.3B-diffusers |
| Wan2.1-VACE (14B) | Wan2.1 | TBD | ~38 | Wan-AI/Wan2.1-VACE-14B-diffusers |
Info
*Peak VRAM: based on basic single-card usage, 480x832 resolution, 81 frames, 30 inference steps, without any acceleration/optimization features.
Default model: Wan-AI/Wan2.1-VACE-14B-diffusers
Quick Start¶
Python API¶
Text-to-video generation:
import torch
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
if __name__ == "__main__":
omni = Omni(model="Wan-AI/Wan2.1-VACE-1.3B-diffusers")
prompt = "A sleek robot stands in a vast warehouse filled with boxes"
outputs = omni.generate(
prompt,
OmniDiffusionSamplingParams(
height=480,
width=832,
num_frames=81,
num_inference_steps=30,
guidance_scale=5.0,
),
)
video = outputs[0].images
from diffusers.utils import export_to_video
export_to_video(list(video[0]), "t2v_output.mp4", fps=16)
omni.close()
Local CLI Usage¶
python vace_video_generation.py \
--mode t2v \
--prompt "A sleek robot stands in a vast warehouse filled with boxes" \
--height 480 --width 832 --num-frames 81 \
--output t2v_output.mp4
Key Arguments¶
| Argument | Type | Default | Description |
|---|---|---|---|
--mode | str | "t2v" | VACE task mode: t2v, i2v, v2lf, flf2v, inpaint, r2v |
--model | str | "Wan-AI/Wan2.1-VACE-14B-diffusers" | Model ID or local path |
--prompt | str | "A cat walking in a garden" | Text description of desired video |
--negative-prompt | str | "" | Negative prompt for classifier-free guidance |
--image | str | None | Input image path (for I2V, R2V, FLF2V, inpaint modes) |
--last-image | str | None | Last frame image path (for FLF2V mode) |
--height | int | 480 | Output video height in pixels (should be a multiple of 16) |
--width | int | 832 | Output video width in pixels (should be a multiple of 16) |
--num-frames | int | 81 | Number of video frames to generate |
--num-inference-steps | int | 30 | Number of denoising steps (more steps = higher quality, slower) |
--guidance-scale | float | 5.0 | Classifier-free guidance scale |
--flow-shift | float | 5.0 | Scheduler flow shift parameter |
--seed | int | 42 | Random seed for deterministic sampling |
--fps | int | 16 | Frames per second for the saved MP4 |
--output | str | "vace_output.mp4" | Path to save the generated video |
--vae-use-tiling | flag | on | Enable VAE tiling for memory optimization |
--cfg-parallel-size | int | 1 | Set to 2 to enable CFG Parallel |
--ulysses-degree | int | 1 | Ulysses sequence parallel degree for multi-GPU inference |
--ring-degree | int | 1 | Ring sequence parallel degree for hybrid Ulysses + Ring inference |
--tensor-parallel-size | int | — | Tensor parallel size |
--enforce-eager | flag | off | Disable torch.compile |
If you encounter OOM errors, try
--vae-use-tilingor multi-GPU parallelism options (--ulysses-degree,--cfg-parallel-size).
More CLI Examples¶
Image-to-Video (I2V)¶
First frame is kept, remaining frames are generated:
python vace_video_generation.py \
--mode i2v \
--image astronaut.jpg \
--prompt "An astronaut emerging from a cracked egg on the moon" \
--height 480 --width 832 --num-frames 81 \
--output i2v_output.mp4
First-Last-Frame Interpolation (FLF2V)¶
python vace_video_generation.py \
--mode flf2v \
--image first_frame.jpg --last-image last_frame.jpg \
--prompt "A bird takes off from a branch and lands on another" \
--height 512 --width 512 --num-frames 81 \
--output flf2v_output.mp4
Inpainting¶
Center vertical stripe is masked and regenerated:
python vace_video_generation.py \
--mode inpaint \
--image scene.jpg \
--prompt "Shrek walks out of a building" \
--height 480 --width 832 --num-frames 81 \
--output inpaint_output.mp4
Reference Image-guided (R2V)¶
python vace_video_generation.py \
--mode r2v \
--image reference.jpg \
--prompt "Camera slowly zooms out from the character" \
--height 480 --width 832 --num-frames 81 \
--output r2v_output.mp4
Example materials¶
vace_video_generation.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""VACE video generation example.
VACE (Video All-in-one Creation Engine) supports multiple video tasks:
- T2V: Text-to-Video
- I2V: Image-to-Video (first frame conditioning)
- V2LF: Video-to-Last-Frame
- FLF2V: First-Last-Frame interpolation
- Inpainting: Masked region generation
- R2V: Reference image-guided generation
Usage examples:
# T2V (text-to-video)
python vace_video_generation.py --mode t2v --prompt "A robot in a warehouse"
# I2V (image-to-video, first frame kept)
python vace_video_generation.py --mode i2v --image input.jpg --prompt "..."
# FLF2V (first-last frame interpolation)
python vace_video_generation.py --mode flf2v --image first.jpg --last-image last.jpg
# R2V (reference image guided)
python vace_video_generation.py --mode r2v --image ref.jpg --prompt "..."
"""
import argparse
import time
from pathlib import Path
import numpy as np
import PIL.Image
import torch
from vllm_omni.diffusion.data import DiffusionParallelConfig
from vllm_omni.entrypoints.omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
from vllm_omni.platforms import current_omni_platform
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="VACE video generation.")
parser.add_argument(
"--model",
default="Wan-AI/Wan2.1-VACE-14B-diffusers",
help="VACE model ID or local path.",
)
parser.add_argument(
"--mode",
default="t2v",
choices=["t2v", "i2v", "v2lf", "flf2v", "inpaint", "r2v"],
help="Generation mode.",
)
parser.add_argument("--prompt", default="A cat walking in a garden", help="Text prompt.")
parser.add_argument("--negative-prompt", default="", help="Negative prompt.")
parser.add_argument("--image", type=str, default=None, help="Input image path (for I2V, R2V, FLF2V, inpaint).")
parser.add_argument("--last-image", type=str, default=None, help="Last frame image path (for FLF2V).")
parser.add_argument("--video-dir", type=str, default=None, help="Directory of video frames (for inpaint).")
parser.add_argument("--seed", type=int, default=42, help="Random seed.")
parser.add_argument("--guidance-scale", type=float, default=5.0, help="CFG guidance scale.")
parser.add_argument("--height", type=int, default=480, help="Video height.")
parser.add_argument("--width", type=int, default=832, help="Video width.")
parser.add_argument("--num-frames", type=int, default=81, help="Number of frames.")
parser.add_argument("--num-inference-steps", type=int, default=30, help="Sampling steps.")
parser.add_argument("--flow-shift", type=float, default=5.0, help="Scheduler flow_shift.")
parser.add_argument("--output", type=str, default="vace_output.mp4", help="Output video path.")
parser.add_argument("--fps", type=int, default=16, help="Output video FPS.")
parser.add_argument("--vae-use-tiling", action="store_true", default=True, help="Enable VAE tiling.")
parser.add_argument("--enforce-eager", action="store_true", help="Disable torch.compile.")
parser.add_argument("--ulysses-degree", type=int, default=1, help="Ulysses SP degree.")
parser.add_argument("--ring-degree", type=int, default=1, help="Ring attention degree.")
parser.add_argument("--cfg-parallel-size", type=int, default=1, choices=[1, 2], help="CFG parallel size.")
return parser.parse_args()
def build_prompts(args):
"""Build prompt dict with multi_modal_data based on mode."""
h, w, nf = args.height, args.width, args.num_frames
gray = PIL.Image.new("RGB", (w, h), (128, 128, 128))
mask_black = PIL.Image.new("L", (w, h), 0)
mask_white = PIL.Image.new("L", (w, h), 255)
prompt_data = {
"prompt": args.prompt,
"negative_prompt": args.negative_prompt,
}
if args.mode == "t2v":
return prompt_data
if args.mode == "r2v":
assert args.image, "--image required for R2V mode"
ref_img = PIL.Image.open(args.image).convert("RGB").resize((w, h))
prompt_data["multi_modal_data"] = {"reference_images": [ref_img]}
return prompt_data
if args.mode == "i2v":
assert args.image, "--image required for I2V mode"
img = PIL.Image.open(args.image).convert("RGB").resize((w, h))
prompt_data["multi_modal_data"] = {
"video": [img] + [gray] * (nf - 1),
"mask": [mask_black] + [mask_white] * (nf - 1),
}
return prompt_data
if args.mode == "v2lf":
assert args.image, "--image required for V2LF mode"
img = PIL.Image.open(args.image).convert("RGB").resize((w, h))
prompt_data["multi_modal_data"] = {
"video": [gray] * (nf - 1) + [img],
"mask": [mask_white] * (nf - 1) + [mask_black],
}
return prompt_data
if args.mode == "flf2v":
assert args.image and args.last_image, "--image and --last-image required for FLF2V"
first = PIL.Image.open(args.image).convert("RGB").resize((w, h))
last = PIL.Image.open(args.last_image).convert("RGB").resize((w, h))
prompt_data["multi_modal_data"] = {
"video": [first] + [gray] * (nf - 2) + [last],
"mask": [mask_black] + [mask_white] * (nf - 2) + [mask_black],
}
return prompt_data
if args.mode == "inpaint":
assert args.image, "--image required for inpaint mode"
img = PIL.Image.open(args.image).convert("RGB").resize((w, h))
d = 80
frames, masks = [], []
for _ in range(nf):
base = np.array(img).copy()
mask = PIL.Image.new("L", (w, h), 0)
stripe = PIL.Image.new("L", (2 * d, h), 255)
mask.paste(stripe, (w // 2 - d, 0))
base[np.array(mask) > 128] = 128
frames.append(PIL.Image.fromarray(base))
masks.append(mask)
prompt_data["multi_modal_data"] = {"video": frames, "mask": masks}
return prompt_data
raise ValueError(f"Unknown mode: {args.mode}")
def main():
args = parse_args()
generator = torch.Generator(device=current_omni_platform.device_type).manual_seed(args.seed)
parallel_config = DiffusionParallelConfig(
ulysses_degree=args.ulysses_degree,
ring_degree=args.ring_degree,
cfg_parallel_size=args.cfg_parallel_size,
)
omni = Omni(
model=args.model,
vae_use_tiling=args.vae_use_tiling,
flow_shift=args.flow_shift,
enforce_eager=args.enforce_eager,
parallel_config=parallel_config,
)
prompt_data = build_prompts(args)
print(f"\n{'=' * 60}")
print(f"VACE {args.mode.upper()} Generation")
print(f" Model: {args.model}")
print(f" Size: {args.width}x{args.height}, {args.num_frames} frames, {args.num_inference_steps} steps")
print(f"{'=' * 60}\n")
start = time.perf_counter()
outputs = omni.generate(
prompt_data,
OmniDiffusionSamplingParams(
height=args.height,
width=args.width,
num_frames=args.num_frames,
num_inference_steps=args.num_inference_steps,
guidance_scale=args.guidance_scale,
generator=generator,
),
)
elapsed = time.perf_counter() - start
video = outputs[0].images
if isinstance(video, list):
video = video[0]
if isinstance(video, torch.Tensor):
video = video.cpu().numpy()
if video.ndim == 5:
video = video[0]
print(f"Output shape: {video.shape}, Time: {elapsed:.1f}s")
output_path = Path(args.output)
output_path.parent.mkdir(parents=True, exist_ok=True)
from diffusers.utils import export_to_video
if np.issubdtype(video.dtype, np.integer):
video = video.astype(np.float32) / 255.0
export_to_video(list(video), str(output_path), fps=args.fps)
print(f"Saved to {output_path}")
omni.close()
if __name__ == "__main__":
main()