vllm_omni.diffusion.lora.layers.base_linear ¶
DiffusionBaseLinearLayerWithLoRA ¶
Bases: BaseLinearLayerWithLoRA
Diffusion-specific base that overrides apply() to use direct torch matmul instead of punica_wrapper.
punica_wrapper is used to hold multiple LoRA slots and slices efficiently.
This matches the semantics of PunicaWrapperGPU.add_lora_linear(): - Shrink: buffer = (x @ lora_a.T) - Expand: y += buffer @ lora_b.T
All other functionality (weight management, TP slicing, forward logic) is inherited from vLLM's BaseLinearLayerWithLoRA.
apply ¶
override: Use simple matmul instead of punica_wrapper.add_lora_linear().
This matches the exact computation in PunicaWrapperGPU.add_lora_linear() for the single-LoRA case. For packed projections (e.g. fused QKV), we apply LoRA per-slice using output_slices.