vLLM Update Procedure¶

This document describes the complete procedure for upgrading the sendnn-inference plugin to a newly released vLLM version. It covers bumping the pin in pyproject.toml, regenerating uv.lock, running CPU + compat tests, adding compat shims if upstream APIs changed, and extending the CI backwards-compat matrix.

This procedure has produced clean upgrades for past patch releases (0.20.1, 0.20.2, 0.21.0) and surfaced the right compat shims for breaking releases (0.20.0, 0.10.1).

AI Assistant Skills

This procedure is referenced by both Claude Code's /update-vllm skill and IBM Bob's update-vllm skill. When using these assistants, they will automatically follow this procedure and replace {VERSION} with the actual version number you specify.

!!! note Do not attempt to upgrade vLLM on Mac systems. Upstream vLLM forcefully overrides VLLM_TARGET_DEVICE to cpu on Macs which results in an incorrect vLLM resolution and the addition of unneeded dependencies to the generated lockfile.

The companion doc at maintaining.md provides the high-level workflow context. This document is the operational checklist: exact commands, ordered tests, and the compat-shim playbook.

Phase 0 — Preflight¶

Verify the upstream tag exists. If this returns 404, stop — the version is incorrect:

curl -fsS https://api.github.com/repos/vllm-project/vllm/git/refs/tags/v{VERSION}

Detect hardware mode:

echo "AIU_WORLD_SIZE=${AIU_WORLD_SIZE:-<unset>}"

Unset → CPU/eager mode. All test invocations must use -m "cpu" (and source _local_envs_for_test.sh). Spyre-marked tests are deferred to the post-PR bot:test validation suite.
Set → hardware available. After CPU tests pass, optionally also run -m "spyre" markers.
Branch from main if not already on a vllm-{VERSION} branch:

git checkout -b vllm-{VERSION} main

Read the upstream release notes and skim for breaking-change candidates:

gh api repos/vllm-project/vllm/releases/tags/v{VERSION} --jq '.body' | head -200

High-impact areas to flag for this plugin: - Worker / model runner V2 (vllm.v1.worker.worker_base.WorkerBase, compile_or_warm_up_model return type, KVCacheConfig, KVCacheSpec) - Scheduler / engine core (SchedulerOutput, CachedRequestData, NewRequestData, MRv2 changes, KV connectors) - Platform interface (vllm.platforms.Platform — new abstract methods, removed methods) - Sampler / output dataclasses (ModelRunnerOutput, SamplerOutput) - Tokenizer registry (this plugin patches it in platform.py) - Multimodal processor APIs (LlavaNext, Mistral3 → sendnn_inference/multimodal/mm_mappings/) - Configuration (ModelConfig, SchedulerConfig, VllmConfig — new/removed fields) - Removed / deprecated APIs — for each, grep this repo to see if it's still referenced.

Phase 1 — Bump the pin¶

Edit pyproject.toml:

[project.dependencies]: bump the upper bound to <X.Y.(Z+1) (e.g., 0.22.0 → <0.22.1). Keep the existing lower bound unless explicitly raising it.
[tool.uv.sources.vllm].rev: set to vX.Y.Z.
Audit [tool.uv].override-dependencies:

Re-fetch upstream requirements/common.txt and check whether each override is still load-bearing:

curl -fsS https://raw.githubusercontent.com/vllm-project/vllm/v{VERSION}/requirements/common.txt | grep -E "llguidance|torch|transformers"

If vLLM bumped torch, update the torch==X.Y.Z override to match. (vLLM 0.20.0 bumped 2.10→2.11, for example.)
If a forced floor (e.g., llguidance>=1.7.3) is now within vLLM's own range, the override may be relaxable — but only remove it if the original reason (CVE, arch fix) is genuinely resolved upstream.

Then regenerate the lockfile and reinstall:

source .venv/bin/activate    # if not already active
uv sync --active --inexact
bash format.sh

Phase 2 — Run CPU + compat tests¶

Always source the env file first; without it MASTER_ADDR/MASTER_PORT/SENDNN_INFERENCE_DYNAMO_BACKEND=eager are unset and tests will fail or use the wrong backend:

source _local_envs_for_test.sh

Run the suites in this order (cheapest → most expensive). Stop and fix on the first failure:

#	Command	What it covers
1	`pytest -m "cpu and basic and not quantized" --timeout=300 -x`	Smoke — fastest signal
2	`pytest -m "compat or (cpu and basic and not quantized)" --timeout=300`	Existing compat shims still work
3	`pytest -m "cpu and decoder and not quantized and not multimodal" --timeout=300`	Chunked prefill + prefix caching
4	`pytest -m "cpu and embedding and not quantized" --timeout=300`	Embedding
5	`pytest -m "cpu and scoring" --timeout=300`	Scoring
6	`pytest -m "not e2e and not quantized and not spyre and not multimodal" --timeout=300`	Worker + utils
7	`HF_HUB_OFFLINE=0 pytest -m "cpu and multimodal" --timeout=600`	Multimodal (needs HF online)

If everything passes here with no source changes, this is a clean upgrade — skip Phase 3.

Phase 3 — Compat shims (only when tests fail)¶

Use sendnn_inference/compat_utils.py helpers (has_argument, dataclass_fields) plus the patterns below. For every shim added, add a corresponding test in tests/utils/test_upstream_compatibility.py that fails when VLLM_VERSION == "vLLM:lowest" no longer needs it. The existing test_compilation_times_compat is the template.

Symptom	Fix
`ImportError` for moved/removed symbol	`try: from new import X; except ImportError: X = None` (or `from old import X` in the except branch). Branch on `X is None` at call sites. See `spyre_worker.py` `CompilationTimes` shim.
Method gained or lost a kwarg	`kwargs = {"k": v} if has_argument(func, "k") else {}; func(..., **kwargs)`
Dataclass added/removed a field	`if "field" in dataclass_fields(Cls): ...`
New abstract method on `Platform` / `WorkerBase`	Implement a stub on the Spyre subclass. See `manual_seed_all` added in PR #952.
Method signature changed return type	Wrap call site to coerce: `t = old_func(); return New(t) if New is not None else t`
Old `hasattr` guard now redundant	Delete the guard, simplify the call site, and remove the matching compat test.

When in doubt, check past upgrade PRs for precedent: gh pr list --search "vllm" --state merged --limit 30. PR #952 (0.20.0) is the richest example.

Phase 4 — Extend the CI matrix¶

In .github/workflows/test.yml, add a new entry under include: for every released version between vLLM:lowest and the new target version that does not already have an entry. The goal is that every supported version (i.e. every version satisfying the >=lower,<upper constraint in pyproject.toml) has a backwards-compat test job. Mirror the format of the existing intermediate entries — only name and repo change. Keep vLLM:lowest, vLLM:main, and any existing intermediates unchanged.

Phase 5 — Verify and audit¶

Re-run bash format.sh. (ruff, typos, ty, actionlint should all pass.)
Audit git diff --stat uv.lock:
Tiny (~10 lines) → patch bump, only the vLLM rev moved. Expected.
Huge (~3000+ lines) → vLLM minor release reshuffled transitives. Expected.
Medium (50–500 lines) → look closely. Make sure unrelated packages didn't drift in surprising ways.
Verify only expected files changed:

git status
# Expected: pyproject.toml, uv.lock, .github/workflows/test.yml,
#           and (if compat shims added) sendnn_inference/**, tests/utils/test_upstream_compatibility.py

Phase 6 — Open the PR¶

Title: chore: bumps vllm to v{VERSION}

Body skeleton:

## Description
- Bumps vllm support to v{VERSION} (release notes: https://github.com/vllm-project/vllm/releases/tag/v{VERSION})
- <List each compat shim added/removed and why, or "no source changes required">
- <List override-dependency changes, if any>

## Test Plan
- CPU + compat suites pass locally
- Hardware tests pending `bot:test` (run after PR opens)

## Checklist
- [x] I have read the contributing guidelines
- [x] My code follows the project's code style (run `bash format.sh`)
- [ ] I have added tests for my changes (if applicable)
- [ ] I have updated the documentation (if applicable)
- [x] My commits include a `Signed-off-by:` line (DCO compliance)

After the PR is open, request hardware validation by commenting:

bot:test
MARKERS="spyre and not quantized"

For dependency-touching changes, comprehensive spyre-marked tests should be run before merge.

Scope guardrails¶

Do not bump the lower bound during a vLLM upgrade. Lower-bound bumps are a separate task (see maintaining.md → "Removing support for a vLLM version") because they require deleting compat code and pruning the test matrix.
Do not "improve" code outside the compat surface. A vLLM upgrade PR should be reviewable as one focused change.
If a transitive dep moves in a surprising way (uv sync output shows e.g. transformers/torch shifting unexpectedly), surface it to the user before continuing — don't silently accept.