Skip to content

vllm_omni.diffusion.models.dreamzero.action_encoder

Action encoder/decoder for DreamZero.

CategorySpecificLinear

Bases: Module

Per-category linear: W[cat_id] @ x + b[cat_id]

Attributes:

Name Type Description
W

(num_categories, input_dim, hidden_dim) — note: 0.02 * randn init

b

(num_categories, hidden_dim) — zero init

W instance-attribute

W = Parameter(
    0.02 * randn(num_categories, input_dim, hidden_dim)
)

b instance-attribute

b = Parameter(zeros(num_categories, hidden_dim))

forward

forward(x: Tensor, cat_ids: Tensor) -> Tensor

CategorySpecificMLP

Bases: Module

Two-layer MLP: layer1 (relu) → layer2

layer1 instance-attribute

layer1 = CategorySpecificLinear(
    num_categories, input_dim, hidden_dim
)

layer2 instance-attribute

layer2 = CategorySpecificLinear(
    num_categories, hidden_dim, output_dim
)

forward

forward(x: Tensor, cat_ids: Tensor) -> Tensor

MultiEmbodimentActionEncoder

Bases: Module

Encode actions with embodiment-specific weights + sinusoidal timestep.

Flow: actions → W1 → concat(a_emb, pos_enc(timesteps)) → W2 (swish) → W3

Parameters:

Name Type Description Default
action_dim int

action vector dimension (e.g. 32)

required
hidden_size int

output/hidden dimension (e.g. 5120 = model dim)

required
num_embodiments int

number of robot types (e.g. 32)

required

W1 instance-attribute

W1 = CategorySpecificLinear(
    num_embodiments, action_dim, hidden_size
)

W2 instance-attribute

W2 = CategorySpecificLinear(
    num_embodiments, 2 * hidden_size, hidden_size
)

W3 instance-attribute

W3 = CategorySpecificLinear(
    num_embodiments, hidden_size, hidden_size
)

hidden_size instance-attribute

hidden_size = hidden_size

pos_encoding instance-attribute

pos_encoding = SinusoidalPositionalEncoding(hidden_size)

forward

forward(
    actions: Tensor, timesteps: Tensor, cat_ids: Tensor
) -> Tensor

Parameters:

Name Type Description Default
actions Tensor

(B, T, action_dim)

required
timesteps Tensor

(B, T) — per-token timestep

required
cat_ids Tensor

(B,) — embodiment id per sample

required

Returns: (B, T, hidden_size)

SinusoidalPositionalEncoding

Bases: Module

Sinusoidal encoding: (B, T) timesteps → (B, T, dim)

embedding_dim instance-attribute

embedding_dim = embedding_dim

forward

forward(timesteps: Tensor) -> Tensor

swish

swish(x: Tensor) -> Tensor

swish activation: x * sigmoid(x)