TeaCache Guide¶

Table of Content¶

Overview
Quick Start
Example Script
Configuration Parameters
Best Practices
Troubleshooting
Summary

Overview¶

TeaCache accelerates diffusion model inference by caching transformer computations when consecutive timesteps are similar, providing 1.5x-2.0x speedup with minimal quality loss. It dynamically decides whether to reuse cached outputs based on input similarity, making it ideal for production deployments where inference speed matters without sacrificing generation quality.

See supported models list in Supported Models.

Quick Start¶

Basic Usage¶

from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="tea_cache",
)

outputs = omni.generate(
    "A cat sitting on a windowsill",
    OmniDiffusionSamplingParams(num_inference_steps=50),
)

Custom Configuration¶

omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="tea_cache",
    cache_config={
        "rel_l1_thresh": 0.2,  # Controls speed/quality tradeoff
    },
)

Using Environment Variable¶

You can also enable TeaCache via environment variable:

export DIFFUSION_CACHE_BACKEND=tea_cache

Then initialize without explicitly setting cache_backend:

from vllm_omni import Omni

omni = Omni(
    model="Qwen/Qwen-Image",
    cache_config={"rel_l1_thresh": 0.2}
)

Example Script¶

Offline Inference¶

Use python script under examples/offline_inference/text_to_image/ or examples/offline_inference/image_to_image/ with CLI:

# Text-to-image example
python examples/offline_inference/text_to_image/text_to_image.py \
  --model Qwen/Qwen-Image \
  --cache-backend tea_cache

# Image-to-image example
python examples/offline_inference/image_to_image/image_edit.py \
  --model Qwen/Qwen-Image-Edit \
  --image input.png \
  --prompt "Edit description" \
  --cache-backend tea_cache \
  --tea-cache-rel-l1-thresh 0.25

See the text_to_image.py or image_edit.py for detailed configuration options.

Online Serving¶

# Default configuration
vllm serve Qwen/Qwen-Image --omni --port 8091 --cache-backend tea_cache

# Custom configuration
vllm serve Qwen/Qwen-Image --omni --port 8091 \
  --cache-backend tea_cache \
  --cache-config '{"rel_l1_thresh": 0.2}'

Configuration Parameters¶

In OmniDiffusionConfig

Parameter	Type	Default	Description
`rel_l1_thresh`	float	`0.2`	Similarity threshold for cache reuse. Lower values prioritize quality (less caching), higher values prioritize speed (more caching). Suggested range: 0.1-0.8
`coefficients`	list[float] \| None	`None`	Polynomial coefficients for rescaling L1 distance. Must contain exactly 5 elements if provided. If `None`, uses model-specific defaults based on transformer type.

Users can find the default model coefficients in vllm_omni/diffusion/cache/teacache/config.py, for example:

_MODEL_COEFFICIENTS = {
    # Qwen-Image transformer coefficients from ComfyUI-TeaCache
    # Tuned specifically for Qwen's dual-stream transformer architecture
    # Used for all Qwen-Image Family pipelines, in general
    "QwenImageTransformer2DModel": [
        -4.50000000e02,
        2.80000000e02,
        -4.50000000e01,
        3.20000000e00,
        -2.00000000e-02,
    ],
    ...
}

Best Practices¶

When to Use¶

Good for:

Production deployments requiring faster inference, tolerant of minimal quality loss
Scenarios where 1.5-2x speedup is valuable
Useful for single-card acceleration

Not for:

Maximum quality requirements where no degradation is acceptable
Very short inference runs (< 20 steps) where caching overhead may outweigh benefits

Troubleshooting¶

Common Issue 1: Quality Degradation¶

Symptoms: Generated images show artifacts, reduced detail, or inconsistent quality compared to non-cached results

Solution:

# Lower the threshold for more conservative caching
cache_config={"rel_l1_thresh": 0.1}

Common Issue 2: Limited Speedup¶

Symptoms: Actual speedup is less than expected (< 1.3x)

Solutions: 1. Increase the threshold to enable more aggressive caching:

cache_config={"rel_l1_thresh": 0.8}

2. Ensure you're using sufficient inference steps (35+ recommended) 3. Check that your model architecture is supported (see Supported Models section)

Summary¶

✅ Enable TeaCache - Set cache_backend="tea_cache" to get 1.5x-2.0x speedup with optimized defaults
✅ (Optional) Customize - Adjust thresholds and polynomial coefficients for specific speed/quality trade-offs