vllm.v1.worker.gpu.model_states.mm_pruning ¶
Classes:
-
MultiModalPruner–Recomputes M-RoPE positions for multimodal models that prune embeddings
Functions:
-
maybe_create_mm_pruner–Create a MultiModalPruner if the model prunes embeddings and uses M-RoPE.
MultiModalPruner ¶
Recomputes M-RoPE positions for multimodal models that prune embeddings (e.g. Qwen2.5-VL / Qwen3-VL / Nemotron-Nano-VL Efficient Video Sampling).
Pruning models append their mrope-position channels to the (variable-count) media embeddings from embed_multimodal. Those channels must be split off and used to recompute mrope positions before the embeddings are merged.
Methods:
-
recompute–Target forward: split the appended mrope-position channels off each
-
strip–Draft forward: strip the appended position channels only.
Source code in vllm/v1/worker/gpu/model_states/mm_pruning.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | |
_num_window_embeds(req_id, query_start, query_end) ¶
Count the media items contributing embeddings to [query_start, query_end), mirroring EncoderRunner.gather_mm_embeddings' per-request windowing so the flat mm_embeds list can be re-segmented per request.
Note: This logic is intentionally duplicated here rather than being emitted from gather_mm_embeddings, to keep the main path cleaner, since this is a niche feature.
Source code in vllm/v1/worker/gpu/model_states/mm_pruning.py
recompute(mm_embeds, input_batch, req_states) ¶
Target forward: split the appended mrope-position channels off each request's media embeddings, recompute the corrected mrope positions, and stage them back into RopeState. Returns the cleaned, flattened embeddings.
Source code in vllm/v1/worker/gpu/model_states/mm_pruning.py
strip(mm_embeds) ¶
Draft forward: strip the appended position channels only.
Stripping is per-embedding, so no per-request segmentation is needed. The speculator reuses the target's already-recomputed positions, hence there is no position write-back here.
Source code in vllm/v1/worker/gpu/model_states/mm_pruning.py
maybe_create_mm_pruner(model_config, model, rope_state, encoder_cache) ¶
Create a MultiModalPruner if the model prunes embeddings and uses M-RoPE.