Skip to content

vllm_omni.transformers_utils.processors.audiox

Input transform utilities for the AudioX diffusion pipeline.

Loads and normalizes the raw audio/video conditioning signals (file path / URL / data: URI / np.ndarray / torch.Tensor) into the (channels, samples) and [T, C, H, W] tensors the pipeline needs. The pipeline itself stays focused on model forward + sampling logic.

TEXT_VIDEO_TASKS module-attribute

TEXT_VIDEO_TASKS = frozenset({'tv2a', 'tv2m'})

VIDEO_CONDITIONED_TASKS module-attribute

VIDEO_CONDITIONED_TASKS = (
    VIDEO_ONLY_TASKS | TEXT_VIDEO_TASKS
)

VIDEO_ONLY_TASKS module-attribute

VIDEO_ONLY_TASKS = frozenset({'v2a', 'v2m'})

adjust_video_duration

adjust_video_duration(
    frames: Tensor, duration: float, target_fps: int
) -> Tensor

load_video_source

load_video_source(
    source: Any,
    *,
    target_fps: int,
    duration: float,
    seek_time: float = 0.0,
) -> Tensor

materialize_media_source

materialize_media_source(source: str) -> str

Return a local filesystem path for source.

Accepts a local path, a data:<mime>;base64,... URI, or an http(s):// URL. Anything non-local is fetched into a NamedTemporaryFile and that path is returned; callers don't need to clean the tempfile up (the OS does on exit).

normalize_prompts

normalize_prompts(
    prompts: list[Any],
) -> list[dict[str, Any]]

Coerce raw prompt entries into {"prompt": str, ...} dicts (preserves extras).

normalize_video_tensor

normalize_video_tensor(
    frames: Tensor, size: int = 224
) -> Tensor

prepare_audio_reference

prepare_audio_reference(
    source: Any,
    *,
    model_sample_rate: int,
    seconds_start: float,
    seconds_total: float,
    device: device,
) -> Tensor

Decode an audio source into a stereo (2, samples) tensor at the model's rate.

prepare_video_reference

prepare_video_reference(
    source: Any,
    *,
    duration: float,
    target_fps: int,
    seek_time: float = 0.0,
) -> Tensor

Decode a video clip (or single image) into the AudioX [T, 3, 224, 224] form.