Contributing to vLLM Metal¶
Thanks for your interest in contributing! This plugin targets Apple Silicon Macs only — you'll need an M-series Mac running macOS to build, test, and run it.
Development setup¶
git clone https://github.com/vllm-project/vllm-metal.git
cd vllm-metal
# Creates ./.venv-vllm-metal/, installs vLLM core + the plugin, and
# prebuilds the native Metal kernels from your checkout
./install.sh
# Activate the virtualenv
source .venv-vllm-metal/bin/activate
# Install dev dependencies (pytest, ruff, mypy, ...)
pip install -e ".[dev]"
Editing the Metal kernels¶
Release wheels ship the native paged-attention extension and its Metal shader
libraries prebuilt, so end users never compile them. To edit the kernels —
the .metal shaders or paged_ops.cpp — and run from your local source, set:
In this mode the C++ extension is recompiled when its inputs change (the build
is hash-checked, so unchanged sources are skipped) and the shaders are compiled
in-process by MLX from the .metal source at runtime. There is no manual
.metallib rebuild step: edit a kernel, restart the Python process, and the
change is picked up.
Requirements:
- Xcode Command Line Tools (
xcode-select --install) —clang++rebuilds the.so; keep it current enough for the pinned MLX headers. - No Metal toolchain needed: MLX compiles the
.metalshaders in-process.
Without VLLM_METAL_BUILD_FROM_SOURCE, the prebuilt artifacts are loaded as-is.
If you edited a kernel source after building them locally, loading fails
loudly on the stale-hash mismatch rather than silently running the old kernel —
set the variable, or rerun python -m vllm_metal.metal.build to refresh the
prebuilt artifacts. (A plain wheel install ships no hash stamps, so end users
never hit this.)
Run lint locally¶
Mirrors the lint job in CI (ruff, ruff format --check, mypy, shellcheck):
Run CI locally¶
Mirrors the test job in CI — serving smoke tests plus the non-slow pytest suite:
For a faster inner loop while iterating, run pytest directly:
🎉 Congratulations! You have completed the development environment setup.
Before you open the PR¶
Two conditional checks apply depending on what your PR touches:
If your PR adds or modifies a model, include a deterministic test that asserts the generated tokens match the mlx_lm reference under greedy sampling (temperature=0). See tools/gen_golden_token_ids_for_deterministics.py for how to generate golden token IDs for a new model.
If your PR claims a performance improvement, attach before/after benchmark results. For example, using vllm bench serve with the sonnet dataset:
curl -O https://raw.githubusercontent.com/vllm-project/vllm/main/benchmarks/sonnet.txt
# 1. Start the server
VLLM_METAL_USE_PAGED_ATTENTION=1 VLLM_METAL_MEMORY_FRACTION=0.8 \
vllm serve Qwen/Qwen3-0.6B --port 8000 --max-model-len 2048
# 2. Run the benchmark
vllm bench serve \
--backend openai \
--base-url http://localhost:8000 \
--model Qwen/Qwen3-0.6B \
--dataset-name sonnet \
--dataset-path sonnet.txt \
--num-prompts 100 \
--request-rate inf \
--percentile-metrics ttft,tpot,e2el \
--metric-percentiles 50,99
Developer Certificate of Origin (DCO)¶
When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.
Using -s with git commit will automatically add this header.
Submit your changes¶
-
Fork the repository on GitHub.
-
Re-point
originto your fork and addupstream:
git remote set-url origin https://github.com/<your-username>/vllm-metal.git
git remote add upstream https://github.com/vllm-project/vllm-metal.git
- Create a feature branch:
- Commit your changes using
-s(adds the DCO sign-off automatically):
- Push to your fork:
- Open a pull request against
mainin the upstream repository.