Distributed Inference with Ray¶
Warning
This is a contributor-facing feature (scaffolding for upcoming multi-Mac support), not intended for general users — for normal single-Mac serving, use the default in-process executor.
vllm-metal can run under vLLM's Ray distributed executor, placing each Apple-Silicon worker as a Ray actor. This is the groundwork for multi-Mac serving; today the single-node path (one Mac, --tensor-parallel-size 1) is supported and validated. Multi-node tensor / pipeline parallelism is in progress (see Limitations).
Apple GPUs are not a Ray-recognized accelerator type (unlike CUDA or TPU), so each node advertises a custom Ray resource named mlx, and vLLM's executor places one worker per mlx unit.
Requirements¶
rayinstalled alongside vllm-metal:uv pip install ray(orpip install ray).- Each node started with one
mlxresource (one Apple GPU per Mac). - The Ray node IP must match the address vLLM resolves via
get_ip()— do not mix loopback and LAN addresses.
Quick start (single node)¶
# 1. Start a Ray head node advertising the Apple GPU as the "mlx" resource.
# Pin the node IP to the address vLLM uses so placement-group binding matches.
IP=$(python -c "from vllm.utils.network_utils import get_ip; print(get_ip())")
ray start --head --node-ip-address="$IP" --resources='{"mlx": 1}'
# 2. Serve through the Ray executor (connects to the running cluster).
# Plain `--distributed-executor-backend ray` uses the default Ray V2 executor.
RAY_ADDRESS=auto vllm serve Qwen/Qwen3-0.6B \
--distributed-executor-backend ray \
--tensor-parallel-size 1
# 3. Verify generation runs through Ray.
curl -s localhost:8000/v1/completions -H 'Content-Type: application/json' \
-d '{"model":"Qwen/Qwen3-0.6B","prompt":"The capital of France is","max_tokens":16,"temperature":0}'
# → " Paris. The capital of Italy is Rome. The capital of Spain is Madrid."
# Tear down when done.
ray stop
If a node isn't advertising the mlx resource, the engine can't place workers: the Ray executor logs No available node types can fulfill resource request {'mlx': 1.0, ...} and hangs while creating the placement group. (A separate current platform cpu does not support ray error instead means the Metal plugin isn't active or ray_device_key is unset — not a missing resource.)
On a healthy boot, each worker logs vllm_metal: patched Ray V2 worker get_node_and_gpu_ids on RayWorkerProc (Apple-GPU custom Ray resource) — confirming the custom-resource override fired inside the Ray actor.
How it works¶
MetalPlatformsetsray_device_key = "mlx"anddevice_control_env_var = "VLLM_METAL_VISIBLE_DEVICES", so vLLM's Ray executor takes its generic custom-resource placement path (the same one TPU uses) instead of the CUDAnum_gpuspath.- A compatibility shim overrides the Ray worker's
get_node_and_gpu_idsto read the assignedmlxresource — Ray never lists custom resources inget_accelerator_ids()— installed in each worker via a Rayworker_process_setup_hook.
Limitations¶
- Single node,
--tensor-parallel-size 1is what is validated today. Multi-node tensor / pipeline parallelism — where the cross-Mac collectives run through MLX (mx.distributed) — is not yet implemented. - Pipeline parallelism is rejected:
--pipeline-parallel-size > 1raises at startup until cross-stage activation hand-off lands. - One
mlxresource per Mac (a single Apple GPU per machine). - On a multi-NIC host, set the Ray node IP and vLLM's
get_ip()consistently (e.g.ray start --node-ip-address=...).