vllm_omni.diffusion.cache.prompt_embed_cache ¶
Prompt-embedding cache for diffusion pipelines.
Text encoders are the single most expensive per-request preprocessing step in most diffusion pipelines, and their inputs are frequently identical across requests (e.g. in GRPO-style rollouts the same prompt is submitted many times with different seeds to sample different images). Re-running the text encoder for each of those requests wastes compute and GPU memory.
This module provides a small LRU cache plus a transparent wrapper for a pipeline's encode_prompt method. Because almost every diffusion pipeline in vllm_omni/diffusion/models routes all text-encoder invocations through self.encode_prompt(...), wrapping that single method is sufficient to cache results model-wide without per-pipeline edits.
Design points
- The wrapper is installed by :class:
DiffusionModelRunnerafter the pipeline has loaded, so each runner process owns its own cache. - Cache keys are derived from the bound
encode_promptarguments. Only inputs we can safely hash (str/int/float/bool/None/bytes/torch.device/torch.dtype/ numpy scalars / nested lists/tuples/dicts of those) participate in the key. If any argument is a tensor, PIL image, or other non-trivial object, we bypass the cache for that call to guarantee correctness. - If a caller passes precomputed
*_embedsintoencode_promptthe wrapper also bypasses the cache because the call is already short-circuit. - Cache values are detached tensors. Downstream pipeline code typically does non-inplace ops (
.repeat,.view, slicing); we do not clone on hit since outputs are treated as read-only.
PromptEmbedCache ¶
Thread-safe LRU cache for encode_prompt outputs.
The cache stores whatever encode_prompt returns (tensor, tuple of tensors, None, etc.). Lookup / insertion is O(1) amortized. Eviction is least-recently-used.
get ¶
Return the cached value or _CACHE_MISS if absent.
Returning a sentinel (rather than None) lets callers cache legitimate None results from the wrapped function.
install_prompt_embed_cache ¶
install_prompt_embed_cache(
pipeline: Any,
*,
max_size: int = 32,
enabled: bool = True,
model_tag: str | None = None,
) -> PromptEmbedCache | None
Wrap pipeline.encode_prompt so results are cached by argument identity.
Idempotent: calling twice on the same pipeline is a no-op and returns the existing cache. Returns None if the pipeline has no encode_prompt method.
resolve_prompt_embed_cache_config ¶
resolve_prompt_embed_cache_config(
enable: bool | None = None, max_size: int | None = None
) -> tuple[bool, int]
Combine explicit args with env-var overrides.
Environment variables (useful for quick enablement in GRPO jobs without touching config files):
``OMNI_DIFFUSION_PROMPT_EMBED_CACHE`` (``1``/``0``/``true``/``false``)
``OMNI_DIFFUSION_PROMPT_EMBED_CACHE_SIZE`` (positive int)