vllm_omni.diffusion.lora.layers.base_linear ¶

DiffusionBaseLinearLayerWithLoRA ¶

Bases: BaseLinearLayerWithLoRA

Diffusion-specific base that overrides apply() to use direct torch matmul instead of punica_wrapper.

punica_wrapper is used to hold multiple LoRA slots and slices efficiently.

This matches the semantics of PunicaWrapperGPU.add_lora_linear(): - Shrink: buffer = (x @ lora_a.T) - Expand: y += buffer @ lora_b.T

All other functionality (weight management, TP slicing, forward logic) is inherited from vLLM's BaseLinearLayerWithLoRA.

apply ¶

apply(x: Tensor, bias: Tensor | None = None) -> Tensor

override: Use simple matmul instead of punica_wrapper.add_lora_linear().

This matches the exact computation in PunicaWrapperGPU.add_lora_linear() for the single-LoRA case. For packed projections (e.g. fused QKV), we apply LoRA per-slice using output_slices.

create_lora_weights ¶

create_lora_weights(
    max_loras: int, lora_config, model_config=None
) -> None

reset_lora ¶

reset_lora(index: int)

set_lora ¶

set_lora(
    index: int,
    lora_a: Tensor | list[Tensor | None],
    lora_b: Tensor | list[Tensor | None],
)