Enable lazy GDN kernels for eligible hybrid batches. Set to 0 to force the eager conv / C++ recurrent fallback path.
VLLM_METAL_MLA_KERNEL
0
Enable the experimental absorbed-MLA single-pass Metal decode kernel (RFC #360). Off by default; the MLA wrapper falls back to the MLX SDPA per-request slow path. Set to 1 to route absorbed-MLA decode through the kernel when the workload matches the instantiated specialization (kv_lora_rank=512, qk_rope_head_dim=64, block_size ∈ {16, 32}, fp16/bf16, decode-only).
VLLM_METAL_VISIBLE_DEVICES
—
Set automatically by the Ray executor per worker (the device-control var); not user-configurable. See Distributed.
VLLM_METAL_RING_BASE_PORT
32323
Base TCP port for the MLX ring data plane under pipeline parallelism; stage r binds base + r (so the default is 32323/32324 for two stages). Set the same value on every node to move the ring off a busy port — e.g. when an mlx.launch job, a restart still in TIME_WAIT, or another PP job holds the default. See Distributed.
auto: use the text-only compatibility path for checkpoints on the compatibility allowlist, such as Gemma4 and Qwen3.5/Qwen3.6 FP8 conditional-generation wrappers.
text-only-compat: use the same compatibility allowlist as auto.
multimodal-native: disable the compatibility fallback and keep the native multimodal path active when validating or developing real multimodal support.
Pass --speculative-config with a JSON object to enable speculative decoding.
Use --no-async-scheduling (required for all spec-decode methods on Metal).
See Speculative Decoding for supported methods,
model pairing, and memory considerations.