vllm.model_executor.models.config ¶
Classes:
-
ColQwen3_5Config–ColQwen3.5 (late-interaction retrieval) inherits Qwen3.5's mamba cache
-
DiffusionGemmaModelForBlockDiffusionConfig– -
Gemma4Config– -
HybridAttentionMambaModelConfig– -
LlamaNemotronVLConfig–Config handler for LlamaNemotronVL embedding models.
-
MambaModelConfig– -
NemotronHForCausalLMConfig– -
Qwen3_5ForConditionalGenerationConfig– -
UnlimitedOCRForCausalLMConfig–
ColQwen3_5Config ¶
Bases: Qwen3_5ForConditionalGenerationConfig
ColQwen3.5 (late-interaction retrieval) inherits Qwen3.5's mamba cache handling and additionally serves BIDIRECTIONAL attention: ColPali-style document/query encoding attends over the whole sequence, not causally. Set is_causal=False so Qwen3NextAttention builds its full_attention layers with AttentionType.ENCODER_ONLY (the linear_attention GatedDeltaNet layers are unaffected). Generation arches keep the parent (causal) and are untouched.
Source code in vllm/model_executor/models/config.py
DiffusionGemmaModelForBlockDiffusionConfig ¶
Bases: VerifyAndUpdateConfig
Methods:
-
verify_and_update_config–Set up the diffusion config and defaults for DiffusionGemma.
Source code in vllm/model_executor/models/config.py
verify_and_update_config(vllm_config) classmethod ¶
Set up the diffusion config and defaults for DiffusionGemma.
Auto-creates DiffusionConfig from the HF config when the user didn't pass --diffusion-config. Diffusion sampling params are read straight from generation_config.json at sampler-build time (see DiffusionGemma's custom_sampler), not injected here.
Source code in vllm/model_executor/models/config.py
Gemma4Config ¶
Bases: VerifyAndUpdateConfig
Methods:
-
verify_and_update_config–Configure attention for heterogeneous head dimensions.
Source code in vllm/model_executor/models/config.py
verify_and_update_config(vllm_config) staticmethod ¶
Configure attention for heterogeneous head dimensions.
Gemma4 uses different head dimensions for sliding window (head_dim) vs full attention (global_head_dim) layers. The default FA3 on Hopper cannot handle head_dim > 256, which causes mixed backend selection and numerical divergence.
When FA4 is available we force it for ALL layers, giving a uniform kernel path and avoiding the mixed FA3+FA4 penalty. When FA4 is not available we fall back to Triton.
Source code in vllm/model_executor/models/config.py
HybridAttentionMambaModelConfig ¶
Bases: VerifyAndUpdateConfig
Methods:
-
verify_and_update_config–Perform early validation and setup for hybrid attention/mamba models.
Source code in vllm/model_executor/models/config.py
verify_and_update_config(vllm_config) classmethod ¶
Perform early validation and setup for hybrid attention/mamba models.
Block size alignment with mamba page sizes is handled later by Platform.update_block_size_for_backend(), which runs after model layers are constructed and the attention backend is known.
Parameters:
-
(vllm_config¶VllmConfig) –vLLM Config
Source code in vllm/model_executor/models/config.py
LlamaNemotronVLConfig ¶
Bases: VerifyAndUpdateConfig
Config handler for LlamaNemotronVL embedding models.
Source code in vllm/model_executor/models/config.py
MambaModelConfig ¶
Bases: VerifyAndUpdateConfig
Methods:
-
verify_and_update_config–Enable FULL_AND_PIECEWISE cuda graph mode by default (required
Source code in vllm/model_executor/models/config.py
verify_and_update_config(vllm_config) classmethod ¶
Enable FULL_AND_PIECEWISE cuda graph mode by default (required to get good performance for mamba layers in V1).
Parameters:
-
(vllm_config¶VllmConfig) –vLLM Config
Source code in vllm/model_executor/models/config.py
NemotronHForCausalLMConfig ¶
Bases: VerifyAndUpdateConfig
Methods:
-
update_mamba_ssm_cache_dtype–Update mamba_ssm_cache_dtype for NemotronH models when set to 'auto'
Attributes:
-
DEFAULT_MAMBA_SSM_CACHE_DTYPE–Only
float32is known to have no accuracy issues by default.
Source code in vllm/model_executor/models/config.py
DEFAULT_MAMBA_SSM_CACHE_DTYPE = 'float32' class-attribute instance-attribute ¶
Only float32 is known to have no accuracy issues by default.
update_mamba_ssm_cache_dtype(*, cache_config, hf_config) classmethod ¶
Update mamba_ssm_cache_dtype for NemotronH models when set to 'auto' (or not explicitly set), to the value specified in the HF config, or to float32 if not specified.
Source code in vllm/model_executor/models/config.py
Qwen3_5ForConditionalGenerationConfig ¶
Bases: VerifyAndUpdateConfig
Methods:
-
verify_and_update_config–Update mamba_ssm_cache_dtype for Qwen3.5 models when set to 'auto'
Source code in vllm/model_executor/models/config.py
verify_and_update_config(vllm_config) staticmethod ¶
Update mamba_ssm_cache_dtype for Qwen3.5 models when set to 'auto' (or not explicitly set), to the value specified in the HF config's mamba_ssm_dtype field. Warn if the user explicitly overrides it to a different value.
Source code in vllm/model_executor/models/config.py
UnlimitedOCRForCausalLMConfig ¶
Bases: VerifyAndUpdateConfig
Methods:
-
verify_and_update_config–Configure Unlimited-OCR attention backends for R-SWA and vision.
Source code in vllm/model_executor/models/config.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | |
verify_and_update_config(vllm_config) staticmethod ¶
Configure Unlimited-OCR attention backends for R-SWA and vision.
Backend selection — controlled by the standard --attention-config CLI argument (priority order):
-
--attention-config '{"backend": "FLASH_ATTN"}'→ FA4 + rswa_mask_mod. Exact token-level R-SWA.flash_attn_versionis forced to 4 if not already set (R-SWA mask_mod requires FA4; FA3 cannot express it). Raises if FA4 is not available on this device. -
--attention-config '{"backend": "FLEX_ATTENTION"}'→ FlexAttention R-SWA via Triton block mask. -
--attention-config '{"backend": "auto"}'(or omitted) → Auto-detect: FA4 if available (H20/H100 SM90), else FlexAttention.
Regardless of backend, prefix caching is disabled for this model: R-SWA decode-phase KV is not a pure causal function of the prefix (so decode blocks are not reusable), and single-turn image-led OCR prompts rarely hit the prefix cache.
Example — force FlexAttention even on a machine with FA4::
vllm serve baidu/Unlimited-OCR \
--attention-config '{"backend": "FLEX_ATTENTION"}'
Source code in vllm/model_executor/models/config.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | |