Skip to content

vllm_omni.diffusion.model_loader.hub_prefetch

Best-effort HuggingFace Hub prefetch for multi-subfolder pipelines.

This module exists to defend diffusion pipelines against a race condition we hit after the transformers v5 rebase (see Buildkite vllm-omni-rebase

1043 Qwen-Image-Edit-2509 failure):

When several diffusion worker processes start in parallel and each calls SomeModel.from_pretrained(model_id, subfolder="text_encoder", ...) with a cold HuggingFace cache, transformers v5's cache-resolution (cached_files) can observe a partially-written shard set written by a peer worker and raise OSError: <model_id> does not appear to have a file named text_encoder/model-00002-of-00002.safetensors even though the peer will eventually finish writing it.

Why origin/main does not need this helper

The exact same __init__ code lives on origin/main (e.g. the Qwen-Image pipeline_qwen_image_edit_plus.py in build vllm-omni#7412 passes without any prefetch), so the race is NOT a behavioural change in vLLM-Omni itself. Two environmental factors mask the race on main:

  • origin/main is pinned (transitively, via vLLM main) to transformers 4.x. In 4.x the per-file cached_file path resolves shards lazily, one at a time, so each hf_hub_download blocks on its own single-file .lock and the second worker naturally waits for the first worker's atomic rename. transformers>=5.0 rewrote this into cached_files (plural) which batch-resolves every shard listed in the index up-front via os.path.isfile and raises immediately if any shard is still sitting under its *.incomplete name. Same wave of v5 changes that introduced tie_weights(missing_keys=..., recompute_mapping=...) (see the Dynin shim in dynin_omni_token2text.py).
  • CI shares HF_HOME=/fsx/hf_cache across pipelines (both the vllm-omni and vllm-omni-rebase pipelines mount the same FS). That cache is normally warm for long-lived repos like Qwen-Image-Edit-2509, so most builds never go through the download path at all. Build 1043 happened to hit a partially-evicted cache AND transformers v5's stricter resolver simultaneously, which is why the failure looks 'rebase-specific' but is really a latent race that main was absorbing via (1).

huggingface_hub.snapshot_download does take per-blob .lock files, but those locks are only acquired once each blob is mid-write - they do not cover the surrounding cached_files shard-list resolution that transformers v5 performs eagerly. So we additionally wrap the snapshot_download call in our own node-wide fcntl.flock keyed on the repo id (see _repo_prefetch_lock). The first concurrent worker / process to reach the helper fully materialises the snapshot; subsequent entrants block on the flock and then find a warm cache, so their from_pretrained calls never observe a half-written shard set. For a warm cache the snapshot call is a near-noop (it only stat()s the files), so this is also cheap on origin/main should we ever backport it there.

The helper is intentionally best-effort: prefetch failures (offline, gated repos, transient 5xx, missing flock support) are logged and swallowed so the subsequent from_pretrained call can surface the real, specific error to the user rather than being masked here.

logger module-attribute

logger = getLogger(__name__)

from_pretrained_with_prefetch

from_pretrained_with_prefetch(
    factory: Callable[..., Any],
    model: str,
    *,
    subfolder: str,
    prefetch_list: Iterable[str],
    local_files_only: bool = False,
    max_attempts: int = _PREFETCH_MAX_ATTEMPTS,
    **from_pretrained_kwargs: Any,
) -> Any

Call factory.from_pretrained healing a racy / partial HF cache.

factory is a bound SomeModel.from_pretrained (or any callable with the same (model, *, subfolder, local_files_only, **kwargs) signature).

This is a stronger sibling of :func:retry_on_missing_shard: that helper only retries the missing-shard OSError and never re-prefetches, so it cannot recover the second face of the same race. Two shapes of partial -cache failure crash the diffusion server outright:

  • OSError: <repo> does not appear to have a file named text_encoder/model-0000X-of-0000Y.safetensors - a shard is still under its .incomplete name.
  • RuntimeError: You set 'ignore_mismatched_sizes' to 'False' ... - text_encoder/config.json was not present yet, so transformers v5 silently fell back to the default (tiny) config and then could not load the real checkpoint into it.

Both heal once the cache is complete. So on those errors we re-run a verified prefetch (which blocks on the peer writer and retries the download) and reload, instead of letting the worker die. Local paths and local_files_only loads cannot be healed by re-fetching, so they raise on the first failure exactly as before.

prefetch_subfolders

prefetch_subfolders(
    model: str,
    subfolders: Iterable[str],
    *,
    local_files_only: bool | None = None,
    include_root_metadata: bool = True,
) -> None

Materialise model's subfolders in the HF cache before loading.

Parameters:

Name Type Description Default
model str

A HuggingFace Hub repo id (e.g. "Qwen/Qwen-Image-Edit-2509") or a local directory path. Local paths are a no-op.

required
subfolders Iterable[str]

Iterable of subfolder names (e.g. ["text_encoder", "vae"]) whose contents need to be fully present before any worker calls from_pretrained(subfolder=...).

required
local_files_only bool | None

When True, skip the prefetch entirely. When None (default), auto-detect: skip if model is a local directory, run otherwise.

None
include_root_metadata bool

When True, also pull *.json at the repo root so model_index.json / config.json resolution during from_pretrained also hits a warm cache.

True

retry_on_missing_shard

retry_on_missing_shard(
    load_fn,
    *,
    max_retries: int = 3,
    base_delay: float = 5.0,
)

Call load_fn with retry on the transformers v5 shard-resolution race.

When the prefetch lock cannot be acquired (e.g. flock unsupported on the filesystem and dotfile lock times out), from_pretrained may still hit the cached_files race. This wrapper retries with exponential backoff when the OSError message matches the specific "does not appear to have a file named" pattern.