Skip to content

vllm_omni.diffusion.models.dreamzero.image_encoder

DreamZero image encoder.

Only the visual tower used by DreamZero I2V inference is ported here. The checkpoint keys under action_head.image_encoder.* load via simple prefix stripping.

DreamZeroImageEncoder

Bases: Module

Image encoder wrapper.

model instance-attribute

model = _DreamZeroCLIPContainer()

transforms instance-attribute

transforms = Compose(
    [
        Normalize(
            mean=[0.48145466, 0.4578275, 0.40821073],
            std=[0.26862954, 0.26130258, 0.27577711],
        )
    ]
)

encode_image

encode_image(videos: Tensor) -> Tensor

Encode images for I2V conditioning.

DreamZeroLayerNorm

Bases: LayerNorm

LayerNorm that preserves the input dtype.

forward

forward(x: Tensor) -> Tensor

DreamZeroVisionAttentionBlock

Bases: Module

Attention block for the vision tower.

attn instance-attribute

attn = DreamZeroVisionSelfAttention(
    dim, num_heads, proj_dropout=proj_dropout
)

mlp instance-attribute

mlp = Sequential(
    Linear(dim, hidden_dim),
    GELU(),
    Linear(hidden_dim, dim),
    Dropout(proj_dropout),
)

norm1 instance-attribute

norm1 = DreamZeroLayerNorm(dim, eps=norm_eps)

norm2 instance-attribute

norm2 = DreamZeroLayerNorm(dim, eps=norm_eps)

post_norm instance-attribute

post_norm = post_norm

forward

forward(x: Tensor) -> Tensor

DreamZeroVisionSelfAttention

Bases: Module

Self-attention for the vision tower.

dim instance-attribute

dim = dim

head_dim instance-attribute

head_dim = dim // num_heads

num_heads instance-attribute

num_heads = num_heads

proj instance-attribute

proj = Linear(dim, dim)

proj_dropout instance-attribute

proj_dropout = proj_dropout

to_qkv instance-attribute

to_qkv = Linear(dim, dim * 3)

forward

forward(x: Tensor) -> Tensor

DreamZeroVisionTransformer

Bases: Module

Vision transformer used by the image encoder.

cls_embedding instance-attribute

cls_embedding = Parameter(gain * randn(1, 1, dim))

dim instance-attribute

dim = dim

dropout instance-attribute

dropout = Dropout(embedding_dropout)

head instance-attribute

head = Parameter(gain * randn(dim, out_dim))

image_size instance-attribute

image_size = image_size

num_heads instance-attribute

num_heads = num_heads

num_layers instance-attribute

num_layers = num_layers

num_patches instance-attribute

num_patches = (image_size // patch_size) ** 2

patch_embedding instance-attribute

patch_embedding = Conv2d(
    3,
    dim,
    kernel_size=patch_size,
    stride=patch_size,
    bias=not pre_norm,
)

patch_size instance-attribute

patch_size = patch_size

pool_type instance-attribute

pool_type = pool_type

pos_embedding instance-attribute

pos_embedding = Parameter(
    gain * randn(1, num_patches + 1, dim)
)

post_norm instance-attribute

post_norm = DreamZeroLayerNorm(dim, eps=norm_eps)

pre_norm instance-attribute

pre_norm = (
    DreamZeroLayerNorm(dim, eps=norm_eps)
    if pre_norm
    else None
)

transformer instance-attribute

transformer = Sequential(
    *[
        (
            DreamZeroVisionAttentionBlock(
                dim=dim,
                mlp_ratio=mlp_ratio,
                num_heads=num_heads,
                post_norm=post_norm,
                activation=activation,
                proj_dropout=proj_dropout,
                norm_eps=norm_eps,
            )
        )
        for _ in (range(num_layers))
    ]
)

forward

forward(x: Tensor, use_31_block: bool = False) -> Tensor