vllm_omni.diffusion.model_loader.hub_prefetch ¶
Best-effort HuggingFace Hub prefetch for multi-subfolder pipelines.
This module exists to defend diffusion pipelines against a race condition we hit after the transformers v5 rebase (see Buildkite vllm-omni-rebase
1043 Qwen-Image-Edit-2509 failure):¶
When several diffusion worker processes start in parallel and each calls SomeModel.from_pretrained(model_id, subfolder="text_encoder", ...) with a cold HuggingFace cache, transformers v5's cache-resolution (cached_files) can observe a partially-written shard set written by a peer worker and raise OSError: <model_id> does not appear to have a file named text_encoder/model-00002-of-00002.safetensors even though the peer will eventually finish writing it.
Why origin/main does not need this helper¶
The exact same __init__ code lives on origin/main (e.g. the Qwen-Image pipeline_qwen_image_edit_plus.py in build vllm-omni#7412 passes without any prefetch), so the race is NOT a behavioural change in vLLM-Omni itself. Two environmental factors mask the race on main:
origin/mainis pinned (transitively, via vLLM main) totransformers4.x. In 4.x the per-filecached_filepath resolves shards lazily, one at a time, so eachhf_hub_downloadblocks on its own single-file.lockand the second worker naturally waits for the first worker's atomic rename.transformers>=5.0rewrote this intocached_files(plural) which batch-resolves every shard listed in the index up-front viaos.path.isfileand raises immediately if any shard is still sitting under its*.incompletename. Same wave of v5 changes that introducedtie_weights(missing_keys=..., recompute_mapping=...)(see the Dynin shim indynin_omni_token2text.py).- CI shares
HF_HOME=/fsx/hf_cacheacross pipelines (both thevllm-omniandvllm-omni-rebasepipelines mount the same FS). That cache is normally warm for long-lived repos likeQwen-Image-Edit-2509, so most builds never go through the download path at all. Build 1043 happened to hit a partially-evicted cache AND transformers v5's stricter resolver simultaneously, which is why the failure looks 'rebase-specific' but is really a latent race that main was absorbing via (1).
huggingface_hub.snapshot_download does take per-blob .lock files, but those locks are only acquired once each blob is mid-write - they do not cover the surrounding cached_files shard-list resolution that transformers v5 performs eagerly. So we additionally wrap the snapshot_download call in our own node-wide fcntl.flock keyed on the repo id (see _repo_prefetch_lock). The first concurrent worker / process to reach the helper fully materialises the snapshot; subsequent entrants block on the flock and then find a warm cache, so their from_pretrained calls never observe a half-written shard set. For a warm cache the snapshot call is a near-noop (it only stat()s the files), so this is also cheap on origin/main should we ever backport it there.
The helper is intentionally best-effort: prefetch failures (offline, gated repos, transient 5xx, missing flock support) are logged and swallowed so the subsequent from_pretrained call can surface the real, specific error to the user rather than being masked here.
from_pretrained_with_prefetch ¶
from_pretrained_with_prefetch(
factory: Callable[..., Any],
model: str,
*,
subfolder: str,
prefetch_list: Iterable[str],
local_files_only: bool = False,
max_attempts: int = _PREFETCH_MAX_ATTEMPTS,
**from_pretrained_kwargs: Any,
) -> Any
Call factory.from_pretrained healing a racy / partial HF cache.
factory is a bound SomeModel.from_pretrained (or any callable with the same (model, *, subfolder, local_files_only, **kwargs) signature).
This is a stronger sibling of :func:retry_on_missing_shard: that helper only retries the missing-shard OSError and never re-prefetches, so it cannot recover the second face of the same race. Two shapes of partial -cache failure crash the diffusion server outright:
OSError: <repo> does not appear to have a file named text_encoder/model-0000X-of-0000Y.safetensors- a shard is still under its.incompletename.RuntimeError: You set 'ignore_mismatched_sizes' to 'False' ...-text_encoder/config.jsonwas not present yet, sotransformersv5 silently fell back to the default (tiny) config and then could not load the real checkpoint into it.
Both heal once the cache is complete. So on those errors we re-run a verified prefetch (which blocks on the peer writer and retries the download) and reload, instead of letting the worker die. Local paths and local_files_only loads cannot be healed by re-fetching, so they raise on the first failure exactly as before.
prefetch_subfolders ¶
prefetch_subfolders(
model: str,
subfolders: Iterable[str],
*,
local_files_only: bool | None = None,
include_root_metadata: bool = True,
) -> None
Materialise model's subfolders in the HF cache before loading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model | str | A HuggingFace Hub repo id (e.g. | required |
subfolders | Iterable[str] | Iterable of subfolder names (e.g. | required |
local_files_only | bool | None | When | None |
include_root_metadata | bool | When True, also pull | True |
retry_on_missing_shard ¶
Call load_fn with retry on the transformers v5 shard-resolution race.
When the prefetch lock cannot be acquired (e.g. flock unsupported on the filesystem and dotfile lock times out), from_pretrained may still hit the cached_files race. This wrapper retries with exponential backoff when the OSError message matches the specific "does not appear to have a file named" pattern.