vLLM Update Procedure¶
This document describes the complete procedure for upgrading the sendnn-inference plugin to a newly released vLLM version. It covers bumping the pin in pyproject.toml, regenerating uv.lock, running CPU + compat tests, adding compat shims if upstream APIs changed, and extending the CI backwards-compat matrix.
This procedure has produced clean upgrades for past patch releases (0.20.1, 0.20.2, 0.21.0) and surfaced the right compat shims for breaking releases (0.20.0, 0.10.1).
AI Assistant Skills
This procedure is referenced by both Claude Code's /update-vllm skill and IBM Bob's update-vllm skill. When using these assistants, they will automatically follow this procedure and replace {VERSION} with the actual version number you specify.
The companion doc at maintaining.md provides the high-level workflow context. This document is the operational checklist: exact commands, ordered tests, and the compat-shim playbook.
Phase 0 — Preflight¶
- Verify the upstream tag exists. If this returns 404, stop — the version is incorrect:
- Detect hardware mode:
- Unset → CPU/eager mode. All test invocations must use
-m "cpu"(and source_local_envs_for_test.sh). Spyre-marked tests are deferred to the post-PRbot:testvalidation suite. -
Set → hardware available. After CPU tests pass, optionally also run
-m "spyre"markers. -
Branch from
mainif not already on avllm-{VERSION}branch:
- Read the upstream release notes and skim for breaking-change candidates:
High-impact areas to flag for this plugin:
- Worker / model runner V2 (vllm.v1.worker.worker_base.WorkerBase, compile_or_warm_up_model return type, KVCacheConfig, KVCacheSpec)
- Scheduler / engine core (SchedulerOutput, CachedRequestData, NewRequestData, MRv2 changes, KV connectors)
- Platform interface (vllm.platforms.Platform — new abstract methods, removed methods)
- Sampler / output dataclasses (ModelRunnerOutput, SamplerOutput)
- Tokenizer registry (this plugin patches it in platform.py)
- Multimodal processor APIs (LlavaNext, Mistral3 → sendnn_inference/multimodal/mm_mappings/)
- Configuration (ModelConfig, SchedulerConfig, VllmConfig — new/removed fields)
- Removed / deprecated APIs — for each, grep this repo to see if it's still referenced.
Phase 1 — Bump the pin¶
Edit pyproject.toml:
[project.dependencies]: bump the upper bound to<X.Y.(Z+1)(e.g., 0.22.0 →<0.22.1). Keep the existing lower bound unless explicitly raising it.[tool.uv.sources.vllm].rev: set tovX.Y.Z.- Audit
[tool.uv].override-dependencies: -
Re-fetch upstream
requirements/common.txtand check whether each override is still load-bearing: -
If vLLM bumped torch, update the
torch==X.Y.Zoverride to match. (vLLM 0.20.0 bumped 2.10→2.11, for example.) - If a forced floor (e.g.,
llguidance>=1.7.3) is now within vLLM's own range, the override may be relaxable — but only remove it if the original reason (CVE, arch fix) is genuinely resolved upstream.
Then regenerate the lockfile and reinstall:
Phase 2 — Run CPU + compat tests¶
Always source the env file first; without it MASTER_ADDR/MASTER_PORT/SENDNN_INFERENCE_DYNAMO_BACKEND=eager are unset and tests will fail or use the wrong backend:
Run the suites in this order (cheapest → most expensive). Stop and fix on the first failure:
| # | Command | What it covers |
|---|---|---|
| 1 | pytest -m "cpu and basic and not quantized" --timeout=300 -x |
Smoke — fastest signal |
| 2 | pytest -m "compat or (cpu and basic and not quantized)" --timeout=300 |
Existing compat shims still work |
| 3 | pytest -m "cpu and decoder and not quantized and not multimodal" --timeout=300 |
Chunked prefill + prefix caching |
| 4 | pytest -m "cpu and embedding and not quantized" --timeout=300 |
Embedding |
| 5 | pytest -m "cpu and scoring" --timeout=300 |
Scoring |
| 6 | pytest -m "not e2e and not quantized and not spyre and not multimodal" --timeout=300 |
Worker + utils |
| 7 | HF_HUB_OFFLINE=0 pytest -m "cpu and multimodal" --timeout=600 |
Multimodal (needs HF online) |
If everything passes here with no source changes, this is a clean upgrade — skip Phase 3.
Phase 3 — Compat shims (only when tests fail)¶
Use sendnn_inference/compat_utils.py helpers (has_argument, dataclass_fields) plus the patterns below. For every shim added, add a corresponding test in tests/utils/test_upstream_compatibility.py that fails when VLLM_VERSION == "vLLM:lowest" no longer needs it. The existing test_compilation_times_compat is the template.
| Symptom | Fix |
|---|---|
ImportError for moved/removed symbol |
try: from new import X; except ImportError: X = None (or from old import X in the except branch). Branch on X is None at call sites. See spyre_worker.py CompilationTimes shim. |
| Method gained or lost a kwarg | kwargs = {"k": v} if has_argument(func, "k") else {}; func(..., **kwargs) |
| Dataclass added/removed a field | if "field" in dataclass_fields(Cls): ... |
New abstract method on Platform / WorkerBase |
Implement a stub on the Spyre subclass. See manual_seed_all added in PR #952. |
| Method signature changed return type | Wrap call site to coerce: t = old_func(); return New(t) if New is not None else t |
Old hasattr guard now redundant |
Delete the guard, simplify the call site, and remove the matching compat test. |
When in doubt, check past upgrade PRs for precedent: gh pr list --search "vllm" --state merged --limit 30. PR #952 (0.20.0) is the richest example.
Phase 4 — Extend the CI matrix¶
In .github/workflows/test.yml, add a new entry under include: for the previous version (the old default). Mirror the format of the existing vLLM:0.20.x entries — only name and repo change. Keep vLLM:lowest, vLLM:main, and existing intermediates.
Phase 5 — Verify and audit¶
- Re-run
bash format.sh. (ruff, typos, ty, actionlint should all pass.) - Audit
git diff --stat uv.lock: - Tiny (~10 lines) → patch bump, only the vLLM rev moved. Expected.
- Huge (~3000+ lines) → vLLM minor release reshuffled transitives. Expected.
- Medium (50–500 lines) → look closely. Make sure unrelated packages didn't drift in surprising ways.
- Verify only expected files changed:
git status
# Expected: pyproject.toml, uv.lock, .github/workflows/test.yml,
# and (if compat shims added) sendnn_inference/**, tests/utils/test_upstream_compatibility.py
Phase 6 — Open the PR¶
Title: chore: bumps vllm to v{VERSION}
Body skeleton:
## Description
- Bumps vllm support to v{VERSION} (release notes: https://github.com/vllm-project/vllm/releases/tag/v{VERSION})
- <List each compat shim added/removed and why, or "no source changes required">
- <List override-dependency changes, if any>
## Test Plan
- CPU + compat suites pass locally
- Hardware tests pending `bot:test` (run after PR opens)
## Checklist
- [x] I have read the contributing guidelines
- [x] My code follows the project's code style (run `bash format.sh`)
- [ ] I have added tests for my changes (if applicable)
- [ ] I have updated the documentation (if applicable)
- [x] My commits include a `Signed-off-by:` line (DCO compliance)
After the PR is open, request hardware validation by commenting:
For dependency-touching changes, comprehensive spyre-marked tests should be run before merge.
Scope guardrails¶
- Do not bump the lower bound during a vLLM upgrade. Lower-bound bumps are a separate task (see
maintaining.md→ "Removing support for a vLLM version") because they require deleting compat code and pruning the test matrix. - Do not "improve" code outside the compat surface. A vLLM upgrade PR should be reviewable as one focused change.
- If a transitive dep moves in a surprising way (
uv syncoutput shows e.g. transformers/torch shifting unexpectedly), surface it to the user before continuing — don't silently accept.