TeaCache Guide¶
Table of Content¶
Overview¶
TeaCache accelerates diffusion model inference by caching transformer computations when consecutive timesteps are similar, providing 1.5x-2.0x speedup with minimal quality loss. It dynamically decides whether to reuse cached outputs based on input similarity, making it ideal for production deployments where inference speed matters without sacrificing generation quality.
See supported models list in Supported Models.
Quick Start¶
Basic Usage¶
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="tea_cache",
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50),
)
Custom Configuration¶
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="tea_cache",
cache_config={
"rel_l1_thresh": 0.2, # Controls speed/quality tradeoff
},
)
Using Environment Variable¶
You can also enable TeaCache via environment variable:
Then initialize without explicitly setting cache_backend:
from vllm_omni import Omni
omni = Omni(
model="Qwen/Qwen-Image",
cache_config={"rel_l1_thresh": 0.2}
)
Example Script¶
Offline Inference¶
Use python script under examples/offline_inference/text_to_image/ or examples/offline_inference/image_to_image/ with CLI:
# Text-to-image example
python examples/offline_inference/text_to_image/text_to_image.py \
--model Qwen/Qwen-Image \
--cache-backend tea_cache
# Image-to-image example
python examples/offline_inference/image_to_image/image_edit.py \
--model Qwen/Qwen-Image-Edit \
--image input.png \
--prompt "Edit description" \
--cache-backend tea_cache \
--tea-cache-rel-l1-thresh 0.25
See the text_to_image.py or image_edit.py for detailed configuration options.
Online Serving¶
# Default configuration
vllm serve Qwen/Qwen-Image --omni --port 8091 --cache-backend tea_cache
# Custom configuration
vllm serve Qwen/Qwen-Image --omni --port 8091 \
--cache-backend tea_cache \
--cache-config '{"rel_l1_thresh": 0.2}'
Configuration Parameters¶
In OmniDiffusionConfig
| Parameter | Type | Default | Description |
|---|---|---|---|
rel_l1_thresh | float | 0.2 | Similarity threshold for cache reuse. Lower values prioritize quality (less caching), higher values prioritize speed (more caching). Suggested range: 0.1-0.8 |
coefficients | list[float] | None | None | Polynomial coefficients for rescaling L1 distance. Must contain exactly 5 elements if provided. If None, uses model-specific defaults based on transformer type. |
Users can find the default model coefficients in vllm_omni/diffusion/cache/teacache/config.py, for example:
_MODEL_COEFFICIENTS = {
# Qwen-Image transformer coefficients from ComfyUI-TeaCache
# Tuned specifically for Qwen's dual-stream transformer architecture
# Used for all Qwen-Image Family pipelines, in general
"QwenImageTransformer2DModel": [
-4.50000000e02,
2.80000000e02,
-4.50000000e01,
3.20000000e00,
-2.00000000e-02,
],
...
}
Best Practices¶
When to Use¶
Good for:
- Production deployments requiring faster inference, tolerant of minimal quality loss
- Scenarios where 1.5-2x speedup is valuable
- Useful for single-card acceleration
Not for:
- Maximum quality requirements where no degradation is acceptable
- Very short inference runs (< 20 steps) where caching overhead may outweigh benefits
Troubleshooting¶
Common Issue 1: Quality Degradation¶
Symptoms: Generated images show artifacts, reduced detail, or inconsistent quality compared to non-cached results
Solution:
Common Issue 2: Limited Speedup¶
Symptoms: Actual speedup is less than expected (< 1.3x)
Solutions: 1. Increase the threshold to enable more aggressive caching:
2. Ensure you're using sufficient inference steps (35+ recommended) 3. Check that your model architecture is supported (see Supported Models section)Summary¶
- ✅ Enable TeaCache - Set
cache_backend="tea_cache"to get 1.5x-2.0x speedup with optimized defaults - ✅ (Optional) Customize - Adjust thresholds and polynomial coefficients for specific speed/quality trade-offs