CI Failures¶
Overview of CI Checks¶
When you open a PR against vLLM-Omni, several CI checks run automatically:
| Check | Platform | What it does |
|---|---|---|
| pre-commit | GitHub Actions | Runs linting (Ruff), formatting, spell-checking (typos), and YAML validation. |
| Build Wheel | GitHub Actions | Builds Python wheels for Python 3.11 and 3.12 on Ubuntu. Skipped for docs-only or Markdown-only changes (controlled by paths-ignore in the workflow). |
| DCO | GitHub | Verifies every commit has a Signed-off-by line. |
| docs/readthedocs.org:vllm-omni | Read the Docs | Builds the MkDocs documentation site. |
| buildkite/vllm-omni | Buildkite | Runs GPU-based tests on NVIDIA CUDA hardware (L4, H100). |
| buildkite/vllm-omni-amd | Buildkite | Runs GPU-based tests on AMD ROCm hardware (MI325). |
| buildkite/vllm-omni-intel | Buildkite | Runs GPU-based tests on Intel XPU hardware (Intel Arc BMG). |
Step 1: Identify the Failing Check¶
Click the Details link next to the failing check on your PR to open the build log. The most common failures fall into these categories:
pre-commit failures¶
These are typically formatting or linting issues introduced by your PR. Fix them locally:
Then commit the fixes and push.
DCO failures¶
Every commit must include a Signed-off-by line. If you forgot, amend your commits:
For multiple commits, use an interactive rebase to add the sign-off to each one.
Read the Docs failures¶
The documentation build uses MkDocs with fail_on_warning: true, so even a minor warning (not just errors) will cause the build to fail. To reproduce locally:
Common causes include broken cross-references, invalid admonition syntax, or missing files referenced by --8<-- includes.
Buildkite failures¶
Buildkite runs GPU tests in Docker containers. These are the most complex checks and can fail for reasons unrelated to your PR (infrastructure issues, flaky tests, etc.). See the sections below for how to investigate.
Step 2: Check if the Failure Is a Known Issue¶
Before spending time debugging, check whether the failure already exists on the main branch:
- Look at the Buildkite build log — the test name and error message are usually enough to identify the issue.
- Check recent CI runs on
main— if the same test is failing there, the failure is not caused by your PR. - Search existing issues — look for open issues in the vllm-omni issue tracker with the test name or error message.
If the failure is already tracked, leave a comment on your PR noting that the failure is pre-existing and link the issue.
Step 3: Investigate the Failure¶
If the failure appears to be new, investigate whether your changes caused it.
Reading Buildkite Logs¶
- Click Details next to the Buildkite check on your PR.
- Find the failing step in the pipeline (e.g., "Diffusion Model Test", "Simple Unit Test").
- Expand the step to see the full test output with the traceback.
Running Tests Locally¶
For instructions on running tests locally (including specific test files, functions, and markers), see the Running Tests section in the Test Guide.
Step 4: Raise an Issue or Fix It¶
If the failure is pre-existing (not caused by your PR)¶
- Raise a new issue if one doesn't already exist, using the title format:
[CI Failure]: [job-name] - [test-path] - Include the error message, relevant log excerpts, and the commit hash where the failure occurs (e.g., "Still failing on main as of commit
abc1234"). - Leave a comment on your PR linking to the issue and noting that the failure is unrelated to your changes.
If the failure is caused by your PR¶
- Fix the issue in your branch and push the update.
- If the fix is non-trivial, consider adding a test to prevent regression.
Common Failure Patterns¶
| Symptom | Likely Cause | Fix |
|---|---|---|
ruff or formatting errors | Code style violation | Run pre-commit run --all-files |
Signed-off-by missing | DCO check | Amend commits with git commit --amend -s |
| MkDocs build warning | Broken docs reference | Run mkdocs build --strict locally |
OOM or CUDA out of memory | Test exceeds GPU memory | Check if your changes increased memory usage; use --vae_use_slicing / --vae_use_tiling for diffusion tests |
| Import errors | Missing or changed dependency | Check pyproject.toml and make sure dependencies are correct |
| Timeout (step exceeded N minutes) | Test is too slow or hangs | Profile the test; check for infinite loops or deadlocks |
Agent lost in Buildkite | Infrastructure issue (not your fault) | Re-trigger the build; comment on your PR |