VLLM_METAL_MEMORY_FRACTION |
auto |
auto allocates just enough memory plus a minimal KV cache, or 0.? for fraction of memory |
VLLM_METAL_USE_MLX |
1 |
Use MLX for compute (1=yes, 0=no) |
VLLM_MLX_DEVICE |
gpu |
MLX device (gpu or cpu) |
VLLM_METAL_USE_PAGED_ATTENTION |
1 |
Enable experimental paged KV cache |
VLLM_METAL_DEBUG |
0 |
Enable debug logging |
VLLM_METAL_MULTIMODAL_MODE |
auto |
Multimodal serve mode: auto / text-only-compat use the compatibility allowlist; multimodal-native disables overrides |
VLLM_USE_MODELSCOPE |
False |
Set True to change model registry to https://www.modelscope.cn/ |
VLLM_METAL_MODELSCOPE_CACHE |
None |
Specify the absolute path of the local model |
VLLM_METAL_GDN_LAZY_KERNELS |
1 |
Enable lazy GDN kernels for eligible hybrid batches. Set to 0 to force the eager conv / C++ recurrent fallback path. |
VLLM_METAL_MLA_KERNEL |
0 |
Enable the experimental absorbed-MLA single-pass Metal decode kernel (RFC #360). Off by default; the MLA wrapper falls back to the MLX SDPA per-request slow path. Set to 1 to route absorbed-MLA decode through the kernel when the workload matches the instantiated specialization (kv_lora_rank=512, qk_rope_head_dim=64, block_size ∈ {16, 32}, fp16/bf16, decode-only). |
VLLM_METAL_VISIBLE_DEVICES |
— |
Set automatically by the Ray executor per worker (the device-control var); not user-configurable. See Distributed. |