X-To-Video-Audio¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/x_to_video_audio.
The DreamID-Omni pipeline generates short videos from text, image and video.
Local CLI Usage¶
Download the Model locally¶
Since DreamID-Omni combine multiple models, and without any config, so we need to download them locally.
After download, the model directory will look like this:dreamid_omni/
├── DreamID-Omni/
│ ├── dreamid_omni.safetensors
├── MMAudio/
│ ├── ext_weights/
│ │ ├── best_netG.pt
│ │ ├── v1-16.pth
├── Wan2.2-TI2V-5B/
│ ├── google/*
│ ├── models_t5_umt5-xxl-enc-bf16.pth
│ ├── Wan2.2_VAE.pth
│
├── model_index.json
└── transformer/
└── config.json # create by download_dreamid_omni.py
Run the Inference¶
python x_to_video_audio.py \
--model /path/to/dreamid_omni \
--prompt "Two people walking together and singing happily" \
--image-path ./example0.png ./example1.png \
--audio-path ./example0.wav ./example1.wav \
--video-negative-prompt "jitter, bad hands, blur, distortion" \
--audio-negative-prompt "robotic, muffled, echo, distorted" \
--cfg-parallel-size 2 \
--num-inference-steps 45 \
--height 704 \
--width 1280 \
--output out_dreamid_omni_twoip.mp4
python x_to_video_audio.py \
--model /path/to/dreamid_omni \
--prompt "Two people walking together and singing happily" \
--image-path ./example0.png ./example1.png \
--audio-path ./example0.wav ./example1.wav \
--video-negative-prompt "jitter, bad hands, blur, distortion" \
--audio-negative-prompt "robotic, muffled, echo, distorted" \
--cfg-parallel-size 2 \
--num-inference-steps 45 \
--height 704 \
--width 1280 \
--use-hsdp \
--hsdp-shard-size 2 \
--output out_dreamid_omni_twoip.mp4
You could take reference images/audios from the test cases in the official repo: https://github.com/Guoxu1233/DreamID-Omni
For example, single IP ref resources can be found under https://github.com/Guoxu1233/DreamID-Omni/tree/main/test_case/oneip, you could download them correspondingly to your local and use them for testing.
# Example usage for oneip, ref media from the official repo DreamID-Omni
python x_to_video_audio.py \
--model /path/to/dreamid_omni \
--prompt "<img1>: In the frame, a woman with black long hair is identified as <sub1>.\n**Overall Environment/Scene**: A lively open-kitchen café at night; stove flames flare, steam rises, and warm pendant lights swing slightly as staff move behind her. The shot is an upper-body close-up.\n**Main Characters/Subjects Appearance**: <sub1> is a young woman with thick dark wavy hair and a side part. She wears a fitted black top under a light apron, a thin gold chain necklace, and small stud earrings.\n**Main Characters/Subjects Actions**: <sub1> tastes the sauce with a spoon, then turns her face toward the camera while still holding the spoon, her expression shifting from focused to conflicted.\n<sub1> maintains eye contact, swallows as if choosing her words, and says, <S>I keep telling myself I’m fine,but some nights it feels like I’m just performing calm.<E>" \
--image-path 9.png \
--audio-path 9.wav \
--video-negative-prompt "jitter, bad hands, blur, distortion" \
--audio-negative-prompt "robotic, muffled, echo, distorted" \
--cfg-parallel-size 2 \
--num-inference-steps 45 \
--height 704 \
--width 1280 \
--output out_dreamid_omni_oneip.mp4
Key arguments: - --prompt: text description (string). - --model: path to the model local directory. - --height/--width: output resolution (defaults 704 * 1024). - --image-path: path to the input image list. - --audio-path: path to the input audio list, indicate the timbre of the output video. - --cfg-parallel-size: number of parallel cfg parallel (defaults 1). - --use-hsdp: enable HSDP weight sharding for DreamID-Omni fused blocks. - --hsdp-shard-size: number of GPUs used for HSDP sharding. - --hsdp-replicate-size: number of HSDP replica groups. - --num-inference-steps: number of denoising steps (defaults 45). - --video-negative-prompt: negative prompt for video generation. - --audio-negative-prompt: negative prompt for audio generation. - --enable-cpu-offload: enable CPU offload (defaults False). - --cache-backend: enable cache_dit for acceleration. - --quantization: online (dynamic) quantization method — fp8 or int8. (VAEs, the T5 text encoder, norms and modulation stay bf16.)
Example materials¶
download_dreamid_omni.py
import argparse
import fcntl
import os
import site
import subprocess
import tempfile
import time
from pathlib import Path
from huggingface_hub import snapshot_download
DEPENDENCY_REPO = "https://github.com/bytedance/DreamID-V.git"
DEPENDENCY_BRANCH = "omni"
CACHE_DIR = Path(tempfile.gettempdir()) / "vllm-omni-dependency"
LOCK_FILE = CACHE_DIR / ".install.lock"
DEPENDENCY_DIR = CACHE_DIR / "DreamID-Omni"
def download_dependency():
CACHE_DIR.mkdir(parents=True, exist_ok=True)
with open(LOCK_FILE, "w") as f:
fcntl.flock(f, fcntl.LOCK_EX)
if not DEPENDENCY_DIR.exists():
print(f"Downloading DreamID-Omni to {DEPENDENCY_DIR} ...")
subprocess.run(
["git", "clone", "--depth", "1", DEPENDENCY_REPO, "--branch", DEPENDENCY_BRANCH, str(DEPENDENCY_DIR)],
check=True,
)
print("Download finished.")
fcntl.flock(f, fcntl.LOCK_UN)
# write .pth to site-packages
site_packages = Path(site.getsitepackages()[0])
pth_file = site_packages / "vllm_omni_dependency.pth"
pth_file.write_text(str(DEPENDENCY_DIR))
print(f"Added {DEPENDENCY_DIR} to site-packages via {pth_file}")
def timed_download(repo_id: str, local_dir: str, allow_patterns: list | None = None):
"""Download files from HF repo and log time + destination."""
if os.path.exists(local_dir):
print(f"Directory {local_dir} already exists. Skipping download.")
return
print(f"Starting download from {repo_id} into {local_dir}")
start_time = time.time()
snapshot_download(
repo_id=repo_id,
local_dir=local_dir,
local_dir_use_symlinks=False,
allow_patterns=allow_patterns,
)
elapsed = time.time() - start_time
print(f"✅ Finished downloading {repo_id} in {elapsed:.2f} seconds. Files saved at: {local_dir}")
def main(output_dir: str):
# Wan2.2
wan_dir = os.path.join(output_dir, "Wan2.2-TI2V-5B")
timed_download(
repo_id="Wan-AI/Wan2.2-TI2V-5B",
local_dir=wan_dir,
allow_patterns=["google/*", "models_t5_umt5-xxl-enc-bf16.pth", "Wan2.2_VAE.pth"],
)
# MMAudio
mm_audio_dir = os.path.join(output_dir, "MMAudio")
timed_download(
repo_id="hkchengrex/MMAudio",
local_dir=mm_audio_dir,
allow_patterns=["ext_weights/best_netG.pt", "ext_weights/v1-16.pth"],
)
dreamid_dir = os.path.join(output_dir, "DreamID-Omni")
timed_download(repo_id="XuGuo699/DreamID-Omni", local_dir=dreamid_dir)
# Now we construct the config file
import json
data = {
"_class_name": "DreamIDOmniPipeline",
}
with open(os.path.join(output_dir, "model_index.json"), "w", encoding="utf-8") as f:
json.dump(data, f, indent=2)
print(f"model_index.json created at {os.path.join(output_dir, 'model_index.json')}")
transformer_dir = os.path.join(output_dir, "transformer")
os.makedirs(transformer_dir, exist_ok=True)
with open(os.path.join(transformer_dir, "config.json"), "w", encoding="utf-8") as f:
json.dump({"fusion": "DreamID-Omni/dreamid_omni.safetensors"}, f)
print(f"transformer/config.json created at {os.path.join(transformer_dir, 'config.json')}")
# now we download the dependency code
download_dependency()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Download models from Hugging Face")
parser.add_argument(
"--output-dir", type=str, default="./dreamid_omni", help="Base directory to save downloaded models"
)
args = parser.parse_args()
main(args.output_dir)
x_to_video_audio.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/x_to_video_audio/x_to_video_audio.py.