Skip to content

Multi-Level Automated Testing System Documentation

Document Overview

This testing system aims to build a complete, efficient, and well-structured quality assurance framework for the development, integration, and release of model services. It draws on the concept of the test pyramid from modern software engineering, progressively expanding testing activities from basic code logic verification to complex end-to-end (E2E) functionality, performance, accuracy, and even long-term stability validation.

Through five levels (L1-L5) and common (Common) specifications, the system clarifies the testing objectives, scope, execution frequency, and required resources for different development stages (e.g., each commit, PR merge, daily build, pre-release). This ensures that models meet high standards for functionality, performance, and reliability across various deployment scenarios (online serving and offline inference).

Level Scope & Focus Time Cost Test Dir Doc Frequency Hardware
Common Contribution Guideline & PR checklist / / .github/PULL_REQUEST_TEMPLATE.md Test Style (PR Checklist) / /
CI Failure Description / / CI Failures / /
L1
(Unit & Logic)
Unit tests for components like entrypoints, models <15min /tests/{component_name}/test_xxx Chapter 1
Section 1 L1&L2: Purpose, Test Content, Directory Location, Example
PR with ready label (also can run locally) CPU
L2
(E2E across models & GPU-required UT)
Online & Offline (basic deployment scenarios):
dummy, normal inference function (output format, stream), some instance startup UT
/tests/e2e/online_serving/test_{model_name}.py
/tests/e2e/offline_inference/test_{model_name}.py
Chapter 1
L1&L2: Purpose, Test Content, Directory Location, Example
PR with ready label GPU
L3
(Important Perf & Integration & Accuracy)
Online & Offline (multiple deployment scenarios):
real model, normal inference function, normal accuracy
<30min /tests/e2e/online_serving/test_{model_name}.py
/tests/e2e/offline_inference/test_{model_name}.py
Chapter 2
L3: Purpose, Test Content, Directory Location, Example
PR Merged (Also run L1&L2 Tests) GPU
L4
(Perf & Integration & Accuracy)
Online & Offline: full functional scenarios + performance test + doc test <3 hour Full Function:
/tests/e2e/online_serving/test_{model_name}_expansion.py
/tests/e2e/offline_inference/test_{model_name}_expansion.py
Performance:
/tests/dfx/perf/tests/test_qwen_omni.json (Omni), test_tts.json (TTS),
and /tests/dfx/perf/tests/test_{diffusion_model}_vllm_omni.json (Diffusion)
Doc Test:
tests/example/online_serving/test_{model_name}.py
tests/example/offline_inference/test_{model_name}.py
Chapter 3
L4: Purpose, Test Content, Directory Location, Example
Nightly GPU
L5
(Stability & Reliability)
Online & Offline: long-term stability test + reliability test Depends on reality Stability:
/tests/dfx/stability/tests/test_qwen3_omni.json
/tests/dfx/stability/tests/test_wan22.json
Reliability:
tests/dfx/reliability/test_reliability_{model_key}.py
(e.g. test_reliability_qwen3_omni.py, test_reliability_wan22.py)
Chapter 4
L5: Purpose, Test Content, Directory Location, Example
Weekly / Days before Release GPU

The folder structure for tests file based on the 5 levels design Legend: `✅` = test exists, `⬜` = suggested to add.
vllm_omni/                                    tests/
├── config/                             →     ├── config/
│   ├── model.py                              │   └── test_model.py                    ⬜
│   └── lora.py                               │   └── test_lora.py                      ⬜
├── core/                               →     ├── core/
│   └── sched/                                 │   └── sched/
│       ├── omni_ar_scheduler.py               │       ├── test_omni_ar_scheduler.py    ⬜
│       ├── omni_generation_scheduler.py       │       ├── test_omni_generation_scheduler.py  ⬜
│       └── output.py                          │       └── test_output.py               ✅ currently in entrypoints/test_omni_new_request_data.py (tests output.OmniNewRequestData)
├── diffusion/                          →     ├── diffusion/
│   ├── diffusion_engine.py                    │   ├── test_diffusion_engine.py          ⬜
│   ├── attention/                             │   ├── attention/
│   │   ├── layer.py                            │   │   ├── test_attention_sp.py         ✅
│   │   └── backends/                           │   │   └── test_flash_attn.py           ✅
│   ├── distributed/                           │   ├── distributed/
│   │   └── ...                                 │   │   ├── test_comm.py                 ✅
│   │                                            │   │   ├── test_cfg_parallel.py        ✅
│   │                                            │   │   └── test_sp_plan_hooks.py       ✅
│   ├── lora/                                   │   ├── lora/
│   │   └── ...                                 │   │   ├── test_base_linear.py          ✅
│   │                                            │   │   └── test_lora_manager.py        ✅
│   ├── models/                                 │   ├── models/
│   │   ├── qwen_image/                         │   │   ├── qwen_image/                 (e2e coverage)
│   │   ├── ovis_image/                         │   │   ├── ovis_image/
│   │   │   └── ...                             │   │   │   └── test_ovis_image.py     ✅
│   │   ├── z_image/                            │   │   └── z_image/
│   │   └── ...                                 │   │       └── test_zimage_tp_constraints.py  ✅
│   └── worker/                                 │   └── worker/
│       ├── diffusion_worker.py                 │       └── test_diffusion_worker.py   ✅ file at diffusion/test_diffusion_worker.py
│       └── diffusion_model_runner.py            │
├── distributed/                         →     ├── distributed/
│   └── omni_connectors/                         │   └── omni_connectors/
│       ├── adapter.py                           │       ├── test_adapter_and_flow.py   ✅
│       ├── kv_transfer_manager.py               │       ├── test_basic_connectors.py   ✅
│       ├── connectors/                           │       ├── test_kv_flow.py             ✅
│       └── utils/                               │       └── test_omni_connector_configs.py  ✅
├── engine/                             →     ├── engine/
│   ├── input_processor.py                      │   ├── test_input_processor.py         ⬜  (no processor.py in source)
│   ├── output_processor.py                     │   └── test_output_processor.py         ⬜
│   └── arg_utils.py                            │   └── test_arg_utils.py               ⬜
├── entrypoints/                        →     ├── entrypoints/
│   ├── stage_utils.py                          │   ├── test_stage_utils.py            ✅
│   ├── cli/                                     │   ├── cli/                           (benchmarks/test_serve_cli.py covers CLI serve)
│   │   └── ...                                  │   │   └── test_*.py                  ⬜
│   └── openai/                                  │   └── openai_api/                    # maps to entrypoints/openai/
│       ├── api_server.py                        │       ├── test_api_server.py         ⬜  (e2e indirect coverage)
│       ├── serving_chat.py                       │       ├── test_serving_chat_sampling_params.py  ✅
│       ├── serving_speech.py                     │       ├── test_serving_speech.py     ✅
│       └── image_api_utils.py                   │       └── test_image_server.py      ✅
├── inputs/                             →     ├── inputs/
│   ├── data.py                                 │   ├── test_data.py                   ⬜
│   ├── parse.py                                │   ├── test_parse.py                 ⬜
│   └── preprocess.py                            │   └── test_preprocess.py            ✅ currently in entrypoints/test_omni_input_preprocessor.py
├── model_executor/                     →     ├── model_executor/
│   ├── layers/                                  │   ├── layers/
│   │   └── mrope.py                             │   │   └── test_mrope.py              ⬜
│   ├── model_loader/                            │   ├── model_loader/
│   │   └── weight_utils.py                      │   │   └── test_weight_utils.py      ⬜
│   ├── models/                                  │   ├── models/
│   │   ├── qwen2_5_omni/                         │   │   ├── qwen2_5_omni/
│   │   │   ├── qwen2_5_omni_thinker.py           │   │   │   ├── test_audio_length.py  ✅
│   │   │   ├── qwen2_5_omni_talker.py            │   │   │   ├── test_qwen2_5_omni_thinker.py  ⬜
│   │   │   └── qwen2_5_omni_token2wav.py         │   │   │   ├── test_qwen2_5_omni_talker.py  ⬜
│   │   └── qwen3_omni/                          │   │   │   └── test_qwen2_5_omni_token2wav.py  ⬜
│   │       └── ...                               │   │   └── qwen3_omni/
│   ├── stage_configs/                           │   │       └── test_*.py              ⬜
│   │   └── *.yaml                               │   └── stage_configs/                 (used by e2e, test_*.py can be added)  ⬜
│   └── stage_input_processors/                  │   └── stage_input_processors/
│       └── ...                                  │       └── test_*.py                 ⬜
├── sample/                             →     ├── sample/
│   └── __init__.py                             │   └── test_*.py                      ⬜
├── utils/                              →     ├── utils/
│   └── __init__.py                             │   └── test_*.py                       ⬜  (no platform_utils.py currently)
├── worker/                             →     ├── worker/
│   ├── gpu_ar_model_runner.py                  │   ├── test_gpu_ar_model_runner.py    ⬜
│   ├── gpu_ar_worker.py                        │   ├── test_gpu_ar_worker.py           ⬜
│   ├── gpu_generation_model_runner.py          │   ├── test_gpu_generation_model_runner.py  ✅
│   ├── gpu_generation_worker.py                │   ├── test_gpu_generation_worker.py  ⬜
│   ├── gpu_model_runner.py                     │   ├── test_omni_gpu_model_runner.py   ✅
│   └── mixins.py                               │   └── (npu under platforms/npu/worker/)  # not worker/npu/
├── platforms/                          →     (no tests/platforms/, e2e and stage_configs provide indirect coverage)
│   ├── cuda/
│   ├── npu/worker/                             # NPU worker here, not vllm_omni/worker/npu/
│   ├── rocm/
│   └── xpu/worker/
├── outputs.py                          →     test_outputs.py                         ✅ (at tests root)
├── (logger, patch, request, version)    →     (no corresponding unit test)
└── e2e (tests side only)               →     ├── e2e/
                                               ├── online_serving/                     ✅ non-empty
                                               │   ├── test_qwen2_5_omni.py
                                               │   ├── test_async_omni.py
                                               │   ├── test_qwen3_omni.py
                                               │   ├── test_qwen3_omni_expansion.py
                                               │   ├── test_mimo_audio.py
                                               │   └── test_images_generations_lora.py
                                               └── offline_inference/                  ✅
                                                   ├── test_qwen2_5_omni.py
                                                   ├── test_qwen3_omni.py
                                                   ├── test_bagel_text2img.py
                                                   ├── test_z_image.py
                                                   ├── test_wan22.py
                                                   ├── test_zimage_tensor_parallel.py
                                                   ├── test_cache_dit.py
                                                   ├── test_teacache.py
                                                   ├── test_stable_audio_expansion.py
                                                   ├── test_diffusion_cpu_offload.py
                                                   ├── test_diffusion_layerwise_offload.py
                                                   ├── test_diffusion_lora.py
                                                   ├── test_sequence_parallel.py
                                                   └── stage_configs/                  (legacy schema, still
                                                       ├── bagel_*.yaml                 present for unmigrated
                                                       └── npu/, rocm/, etc.            models)

# Migrated models (qwen3_omni_moe, qwen2_5_omni, qwen3_tts) live under
# vllm_omni/deploy/ instead — see docs/configuration/stage_configs.md.

Common Specifications

Before entering specific testing levels, the project establishes two common specifications aimed at standardizing the development process and quickly locating issues.

  1. PR Checklist (Tests Style): This template defines the self-check items that must be completed before submitting a code review (Pull Request). It ensures that each code change meets basic requirements such as code style, dependency updates, and documentation synchronization before entering the automated testing pipeline, serving as the first manual line of defense for quality assurance.
  2. CI Failure Explanation (CI Failures): This document archives and explains common failure patterns in the Continuous Integration (CI) pipeline, error log interpretation, and preliminary troubleshooting steps. It helps developers and testers quickly diagnose the causes of automated test failures, improving problem-solving efficiency.

Chapter 1: L1 & L2 Level Testing - Unit Testing and Basic End-to-End Verification

1.1 Testing Purpose

L1 and L2 level testing form the foundation of the quality assurance system. L1 level testing focuses on verifying the internal logic correctness of code units (e.g., functions, classes), ensuring each independent component behaves as designed.

L2 level testing builds upon L1 by introducing GPU resources and verifying that the end-to-end (E2E) process of the model in basic deployment scenarios is smooth. For example, it uses dummy models to confirm that core interfaces like the inference pipeline, output format, and streaming response work properly. The common goal of these two levels is to provide developers with rapid feedback, discovering and fixing issues early in the development cycle.

1.2 Testing Content and Scope

  • L1 (Unit & Logic Testing):
    • Scope: Tests internal functions and methods of core components such as entrypoints, models.
    • Focus: Branch coverage, exception handling, algorithm logic correctness. Does not involve external dependencies or the complete service stack.
    • Time Cost: Execution time is controlled within 15 minutes to ensure fast feedback.
  • L2 (Basic End-to-End Testing):
    • Scope: Covers two basic deployment scenarios: online (serving) and offline (inference).
    • Focus: Uses dummy models or lightweight real models to verify that the entire chain from request input to result output works normally, including output data structure, streaming (stream) support, etc. Also includes some unit tests that require launching independent service instances.
    • Characteristic: Requires GPU resources to perform model computations.

1.3 Test Directory and Execution Files

A clear directory structure is key to managing test cases efficiently.

  • L1 Test Directory: /tests/{component_name}/test_xxx.py
    • Here, {component_name} corresponds to modules in the source code, such as distributed, entrypoints, etc., and test_xxx.py is the specific test file.
  • L2 Test Directory:
    • Online Serving: /tests/e2e/online_serving/test_{model_name}.py
    • Offline Inference: /tests/e2e/offline_inference/test_{model_name}.py

1.4 Execution Method and Example

  • Trigger Timing: PR with ready label. That is, when a developer adds a "ready for review" or similar label to a PR on platforms like GitHub, L1 and L2 tests are automatically triggered.
  • Execution Environment: L1 uses CPU environment; L2 requires GPU environment.
  • Script Example:
L1 Test Examples Examples from `tests/model_executor/models/qwen2_5_omni/test_audio_length.py`
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

import pytest

pytestmark = [pytest.mark.core_model, pytest.mark.cpu]

def test_resolve_max_mel_frames_default():
    from vllm_omni.model_executor.models.qwen2_5_omni.audio_length import resolve_max_mel_frames

    assert resolve_max_mel_frames(None, default=30000) == 30000
    assert resolve_max_mel_frames(None, default=6000) == 6000


def test_resolve_max_mel_frames_explicit():
    from vllm_omni.model_executor.models.qwen2_5_omni.audio_length import resolve_max_mel_frames

    # Explicit argument always wins over default
    assert resolve_max_mel_frames(123, default=30000) == 123
    assert resolve_max_mel_frames(6000, default=30000) == 6000
    assert resolve_max_mel_frames(0, default=30000) == 0


@pytest.mark.parametrize("repeats", [2, 4])
@pytest.mark.parametrize("code_len", [0, 1, 32768])
@pytest.mark.parametrize("max_mel_frames", [None, -1, 0, 1, 6000, 30000])
def test_cap_and_align_mel_length_no_mismatch(repeats, code_len, max_mel_frames):
    """Guard that any max_mel_frames yields a mel length aligned to repeats, and
    consistent with the truncated code length (prevents concat mismatch).
    """
    from vllm_omni.model_executor.models.qwen2_5_omni.audio_length import cap_and_align_mel_length

    target_code_len, target_mel_len = cap_and_align_mel_length(
        code_len=code_len,
        repeats=repeats,
        max_mel_frames=max_mel_frames,
    )

    assert isinstance(target_code_len, int)
    assert isinstance(target_mel_len, int)

    if code_len == 0:
        assert target_code_len == 0
        assert target_mel_len == 0
        return

    assert target_code_len >= 1
    assert target_mel_len >= repeats
    assert target_mel_len % repeats == 0
    assert target_mel_len == target_code_len * repeats
    assert target_code_len <= code_len

    if max_mel_frames is not None and int(max_mel_frames) > 0 and int(max_mel_frames) >= repeats:
        assert target_mel_len <= int(max_mel_frames)
L2 Test Examples You can refer to Test Examples in Chapter 2 to see example test cases that incorporate both L2 and L3 testing logic.
    • Run Command:

    pytest -s -v /tests/e2e/online_serving/test_{model_name}.py pytest -s -v -m 'core_model and cpu' --run-level=core_model

Chapter 2: L3 Level Testing - Core Integration, Performance, and Accuracy Verification

2.1 Testing Purpose

L3 level testing executes after code is merged into the main branch. Its core purpose is to verify the integration effect, key performance indicators, and output accuracy of real models in multiple deployment scenarios

. It acts as the "quality gatekeeper" for the main branch, ensuring that no merge breaks the core capabilities of the model service. Testing needs to provide clear conclusions within a relatively short time (<30min), balancing test depth with feedback speed.

2.2 Testing Content and Scope

  • Deployment Scenarios: Covers richer online and offline deployment configurations, which may include different hardware configurations, batch sizes, concurrency levels, etc.
  • Core Verification:
    1. Inference Functionality: Ensures real models can perform forward computation normally and return results.
    2. Accuracy Compliance: Verifies that the model's evaluation metrics (e.g., accuracy) meet the expected baseline, preventing code changes from introducing accuracy issues.
    3. Important Performance: Verifies whether performance (e.g., P99 latency, throughput) in core scenarios meets preset thresholds.

2.3 Test Directory and Execution Files

  • Functional Testing:
    • Online Serving: /tests/e2e/online_serving/test_{model_name}_expansion.py
    • Offline Inference: /tests/e2e/offline_inference/test_{model_name}_expansion.py
    • (Note: _expansion.py likely means it contains more comprehensive scenario cases compared to L2 tests).

2.4 Execution Method and Example

  • Trigger Timing: PR Merged. Automatically triggered after code review is approved and merged into the main branch.
  • Execution Environment: GPU servers.
  • Script Example:
Test Examples

2.4.1 Mark Declaration Section

@pytest.mark.advanced_model
@pytest.mark.core_model
@pytest.mark.parametrize("omni_server", test_params, indirect=True)

Explanation:

@pytest.mark.advanced_model: Marks the test as L3 merge level, indicating deep validation with real models. @pytest.mark.full_model: Marks L4 nightly-only suites (e.g. test_*_expansion.py, doc examples).

@pytest.mark.core_model: Marks the test as L1 or L2 level, indicating that this test case validates the basic functionality of the core model. It uses mock weights and only checks if the relevant interface functions correctly.

@pytest.mark.parametrize: A parameterization decorator that allows abstracting test data into parameters, enabling reuse of the same test logic across different data configurations. indirect=True indicates that parameters will be passed to the fixture for processing.

Notes: If you believe the test case only needs to execute basic run logic at the PR-level CI, you can mark it only with @pytest.mark.core_model. If you believe it only needs to execute deep validation at merge (L3), use @pytest.mark.advanced_model. For L4 nightly-only expansion and doc-example tests, use @pytest.mark.full_model with --run-level full_model. If the test case needs both basic run and deep validation, mark with @pytest.mark.core_model and the appropriate L3/L4 marker (advanced_model and/or full_model).

2.4.2 Test Function Definition and Documentation

def test_mix_to_text_audio_001(omni_server, openai_client) -> None:
    """
    Test multi-modal input processing and text/audio output generation via OpenAI API.
    Deploy Setting: default yaml
    Input Modal: text + audio + video + image
    Output Modal: text + audio
    Input Setting: stream=True
    Datasets: single request
    """

Explanation:

Function Naming Convention: Uses the test_ prefix, describes the test scenario mix_to_text_audio, and the number 001 indicates the first test case for this scenario.

Parameter Explanation:

omni_server: Omni server instance obtained via fixture, containing model information and configuration.

openai_client: Unified OpenAI client processing instance, encapsulating request sending and response validation logic.

Docstring: Describes the test purpose, deployment settings, input/output modalities, streaming settings, and dataset type in detail, providing clear context for test maintenance.

2.4.3 Multimodal Data Preparation

video_data_url = f"data:video/mp4;base64,{generate_synthetic_video(224, 224, 300)['base64']}"
image_data_url = f"data:image/jpeg;base64,{generate_synthetic_image(224, 224)['base64']}"
audio_data_url = f"data:audio/wav;base64,{generate_synthetic_audio(5, 1)['base64']}"

Explanation:

Data Generation Functions: Use the generate_synthetic_* series of functions to generate synthetic test data, avoiding reliance on external resources and ensuring test reproducibility and stability.

Parameter Explanation:

Video: width, height, duration_frames

Image: width, height

Audio: duration_seconds, channels

2.4.4 Request Configuration and Keyword Validation

request_config = {
    "model": omni_server.model,
    "messages": messages,
    "stream": True,
    "key_words": {
        "audio": ["water", "cricket"],
        "video": ["sphere", "globe", "circle", "round"],
        "image": ["square", "quadrate"],
        "text": ["beijing"]
    },
}

Explanation:

Model Specification: Uses omni_server.model to ensure the test aligns with the model configured on the server.

Keyword Validation Mechanism: This is an innovative design of the template to address the specific needs of multimodal testing:

Audio Keywords: Validate whether the generated text's description of audio content contains expected elements (e.g., "water" for water sounds, "cricket" for cricket sounds). If you provide multiple keywords, the validation is considered successful if at least one keyword is present.

Video Keywords: Validate whether the generated text's description of video content contains expected elements. If you provide multiple keywords, the validation is considered successful if at least one keyword is present.

Image Keywords: Validate whether the generated text's description of image content contains expected elements. If you provide multiple keywords, the validation is considered successful if at least one keyword is present.

Text Keywords: Validate whether the generated text contains expected elements. If you provide multiple keywords, the validation is considered successful if at least one keyword is present.

2.4.5 Request Execution

openai_client.send_omni_request(request_config, request_num=1)  # for omni-understanding models
# or
openai_client.send_diffusion_request(request_config, request_num=1)  # for diffusion models

Explanation:

Unified Client: Uses the OpenAIClientHandler instance to send requests. This client encapsulates error handling, retry mechanisms, and response validation logic.

Single Request: The comment clearly states this is a single-request completion test. For concurrent testing, it can be extended to multiple requests using request_num = n.

Implicit Validation: The send_omni_request and send_diffusion_request methods internally includes validation logic dynamically selected based on the --run-level parameter: core_model performs basic validation, while advanced_model and full_model perform deep validation.

  • Run Command (L3 merge): pytest -s -v /tests/e2e/online_serving/test_{model_name}.py -m advanced_model --run-level=advanced_model

  • Run Command (L4 nightly expansion): pytest -s -v /tests/e2e/online_serving/test_{model_name}_expansion.py -m full_model --run-level=full_model

Chapter 3: L4 Level Testing - Full Functionality, Performance, and Documentation Testing

3.1 Testing Purpose

L4 level testing is a comprehensive quality audit before a version release. It expands upon L3, executing full functional scenarios, conducting systematic performance stress tests, and simultaneously verifying the correctness of accompanying example documentation. Its purpose is to perform deep validation of the system during off-peak nighttime hours, providing quality trend reports for daytime development and data support for release decisions.

3.2 Testing Content and Scope

  • Full Functionality Testing: Executes all test cases defined in test_{model_name}_expansion.py, covering all implemented features, positive flows, boundary conditions, and exception handling.
  • Performance Testing: Uses tests/dfx/perf/tests/test_qwen_omni.json, tests/dfx/perf/tests/test_tts.json, and diffusion configs in the form tests/dfx/perf/tests/test_*_vllm_omni.json (passed to run_benchmark.py via --test-config-file) to drive performance testing tools for stress, load, and endurance tests, collecting metrics like throughput, response time, and resource utilization.
  • Documentation Testing: Verifies whether the example code provided to users is runnable and its results match the description.

3.3 Test Directory and Execution Files

  • Functional Testing: Same directories as L3.
  • Performance Test Configuration: tests/dfx/perf/tests/test_qwen_omni.json, tests/dfx/perf/tests/test_tts.json, and diffusion configs tests/dfx/perf/tests/test_*_vllm_omni.json (e.g. test_qwen_image_vllm_omni.json)
  • Documentation Example Tests:
    • tests/example/online_serving/test_{model_name}.py
    • tests/example/offline_inference/test_{model_name}.py

3.4 Execution Method and Example

  • Trigger Timing: Nightly, automatically executed every night.
  • Execution Environment: GPU server clusters to meet the resource demands of performance testing.
  • Script Example:
Test Examples: Documentation Example Tests

Preferred Test Strategy

Use one of the following patterns depending on page type:

  • Dynamic code-block extraction (preferred for offline docs)

    • Extract Python/Bash code blocks from markdown AST analyzer, then execute them directly in tests.
    • Benefit: test logic stays automatically aligned with docs.
    • Basic idea: Use ReadmeSnippet.extract_readme_snippets to extract a list of code blocks as a global variable in file, use this list as pytest.mark.parametrize parameters, and pass each snippet item to example_runner.run inside the parametrized test. Additionally pass an output_subfolder argument for the 2nd-level output folder explained in Output Directory Structure below. If any extra environment variable is need for a test (e.g., the example script reads it), example_runner.run also accepts a 3rd env parameter.
    • See tests/examples/offline_inference/test_text_to_image.py for reference implementation.
  • Explicit copied scripts (used by online docs for now until further update)

    • For online serving pages, it is acceptable to copy code from docs into dedicated test functions, because only client-side, request-sending scripts are tested.
    • Benefit: dynamic extraction is overly complex: need to tell server-launch and client-request scripts.
    • Requirement: copied test code must be kept in sync with doc updates.

Test Case Naming Convention

  • Dynamic code extraction (auto-generated internally):
    • test_{single_function_name_matching_file_name}[h2_heading_00X]
    • Example: test_text_to_image[basic_usage_001]
  • Explicit copied scripts:
    • test_{h2_heading_00X}[{dummy_param_id_for_omni_server}]
    • Example: test_api_calls_001[omni_server0]

Runtime Configuration

In the example code tests, do not reduce num_inference_steps just to speed up the tests unless there is a strong CI reliability reason to do otherwise.

Skipping Rules

You may skip examples falling in the following categories using pytest.mark.skip or pytest.skip:

  • Gradio UI scripts
  • Scenarios that significantly overlap with existing tests and add little new coverage.

Output Directory Structure

Use a three-layer output structure to store output artifacts:

  1. Root output directory
    • Auto-detected from OUTPUT_DIR env var or auto-generated under /tmp.
  2. Doc-page directory
    • Define and use a clear page-level folder name in each test_*.py yourself (abbreviations are acceptable, e.g., example_offline_t2i).
  3. Test-case directory
    • Must match the case identifier (e.g., basic_usage_001).
    • Auto-generated for dynamic extracted tests.
Test Examples: Performance Tests

When you want to add L4-level performance test cases, you can refer to the following format for case addition in tests/dfx/perf/tests/test_qwen_omni.json, tests/dfx/perf/tests/test_tts.json, or diffusion configs such as tests/dfx/perf/tests/test_*_vllm_omni.json (selected via pytest ... run_benchmark.py --test-config-file <path>):

{
    "test_name": "test_qwen3_omni",
    "server_params": {
        "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
        "stage_config_name": "qwen3_omni.yaml"
    },
    "benchmark_params": [
        {
            "dataset_name": "random",
            "num_prompts": [10, 20],
            "max_concurrency": [1, 4],
            "random_input_len": 2500,
            "random_output_len": 900,
            "ignore_eos": true,
            "percentile-metrics": "ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration",
            "baseline": {
                "mean_ttft_ms": [500, 800],
                "mean_audio_ttfp_ms": [2000, 3500],
                "mean_audio_rtf": [0.25, 0.35]
            }
        }
    ]
}

Parameter Explanation

Overview

Field Required Description
test_name Yes Unique identifier for the test case
server_params Yes Server-side configuration parameters
benchmark_params Yes Benchmark running parameters (supports multiple configurations)

server_params Configuration

Basic Parameters

Parameter Required Example Description
model Yes "Qwen/Qwen3-Omni-30B-A3B-Instruct" Model name or path
stage_config_name Yes "qwen3_omni.yaml" Stage configuration file name

Dynamic Configuration (update/delete)

Supports incremental modifications based on the basic configuration:

Operation Description
update Update or add configuration items
delete Delete specified configuration items

Example:

"update": {
    "async_chunk": true,  // Enable asynchronous chunk processing
    "stage_args": {
        "0": {
            "engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk"
        }
    }
},
"delete": {
    "stage_args": {
        "2": ["custom_process_input_func"]  // Delete this configuration for stage 2
    }
}

benchmark_params Configuration

You can add any benchmark running parameters you need here. For all optional parameters, refer to the benchmark documentation. General modifications are as follows:

  1. Change the --xxx-xx-xx running parameters to xxx_xx_xx format and fill them as keys in the JSON file.
  2. For boolean variables in the running parameters, modify them to forms such as ignore_eos: true/false and fill them into the JSON file.
  3. Optionally add a baseline object (see Baseline thresholds below). If you omit baseline or leave it empty, the performance test still runs but does not assert metric thresholds from this field.
  4. The qps and concurrency modes are recommended to be mutually exclusive. For detailed explanations, see the table below:
Parameter Type Required Example/Values Description
num_prompts int / array Yes 10,[10, 20, 30] Number of requests. Supports single values or arrays. If a single value is used, it will be automatically expanded to match the number of qps or max_concurrency, e.g., [10,10,10]. If an array is used, its length must match the number of qps or max_concurrency.
request_rate float / array No 0.5, [0.5, 1, inf] Queries per second. Supports single values or arrays. If a single value is used, it will be automatically expanded to match the number of num_prompts, e.g., [1,1,1]. If an array is used, its length must match the number of num_prompts.
max_concurrency int / array No 1, [1, 2, 3] Maximum concurrent in-flight requests. Same array / expansion rules as request_rate (mutually exclusive with QPS mode).
baseline object No see above Optional per-metric thresholds; keys must match benchmark output fields. Scalar, list (per sweep step), or object (keyed by concurrency or QPS string).
Test Examples: Functionality Tests

Scope

For diffusion models, the L4 functionality test covers all or common diffusion features that are supported by this model, including several parallelism acceleration methods, CPU offloading, TeaCache and Cache-DiT cache backends, quantization methods.

Test Case Design

For a high priority model (currently listed in issue #1832), we use several test cases, each with multiple features turned on, so that each supported feature is tested in at least one test case. This is to relieve the GPU workload on the CI machine. The suggested test case combination is as follows:

  • If the model can fit into 4 L4 GPU (with quantization and tensor parallel always on) (20GB RAM each)
    • (1 GPU) TeaCache + Layerwise CPU offloading + GGUF
    • (4 GPUs) CacheDiT + Ulysses=2 + TP=2 + VAE=2 + FP8
    • (4 GPUs) CacheDiT + Ring=2 + HSDP=2 + VAE=2 + GGUF
    • (4 GPUs) TeaCache + CFG=2 + TP=2 + VAE=2 + FP8
  • Otherwise, consider 2 H100 GPU environment (80GB RAM each) with the following tests
    • (1 GPU) TeaCache + Layerwise CPU offloading + GGUF
    • (2 GPUs) CacheDiT + Ulysses=2 + FP8
    • (2 GPUs) CacheDiT + Ring=2 + GGUF
    • (2 GPUs) TeaCache + CFG=2 + FP8
    • (2 GPUs) CacheDiT + TP=2 + VAE=2 + FP8
    • (2 GPUs) CacheDiT + HSDP=2 + VAE=2 + GGUF
  • If 2 H100 GPU cannot handle the model either (e.g., HunyuanImage 3.0)
    • Still design tests and feature combinations that can best fit real-world scenario.
    • Do not include it in CI (or exclude it from the CI's filtering criteria). Instead, relevant PR authors are suggested to run these tests locally.
  • Fallback plan
    • If the model does not support layerwise CPU offloading, replace the corresponding test case with module-wise offloading
    • If the model only supports specific or no caching feature, use this option or remove this option in all test cases.
    • If the model only supports specific or no quantization feature, use this option or remove this option in all test cases.
    • If the model does not support certain other features, remove this option from that test case. If, consequently, the coverage of this modified test case completely overlaps with others, remove this test case.

For a normal priority model, further reduce the number of test cases.

  • Only write one or two test cases for the most common feature combinations.
  • The author can explore themselves to see which feature combination balances output quality and performance. Alternatively, the author can refer to any example code in the PR that adds the model, or the example code in the PR that adds a feature (if the code involves this model of interest).

Currently all the features are available in online serving mode. Hence, only need to add tests/e2e/online_serving/test_{model}_expansion.py.

Code Style

  • Validation: test that the multimodal output files of your model have the correct shapes. OpenAIClientHandler.send_diffusion_request should have taken care of this.
  • Test marks: always add full_model and diffusion for L4 nightly test_*_expansion.py cases. Add GPU-related marks if needed. Ref: Markers for Tests.
  • To maximize code reuse, you may refer to
    • tests/conftest.py for omni_server (running server in subprocess) and openai_client fixtures (sending requests and validating output), generate_synthetic_image and assert_XXX_valid helper.
    • tests/helpers/mark.py for @hardware_test(...) and hardware_marks.
    • Parametrizing tests (pytest doc) to reuse test function implementation for different cases.
  • Doc: add a concise docstring for each test function.
  • Reference L4 test implementation: tests/e2e/online_serving/test_qwen_image_edit_expansion.py.
  • Run Command: (Specific commands would depend on the performance testing tool and configuration defined in nightly.json).

Chapter 4: L5 Level Testing - Stability and Reliability Testing

4.1 Testing Purpose

L5 level testing focuses on the performance of model services under long-running and abnormal fault scenarios. It aims to uncover deep-seated issues that only manifest under sustained pressure or extreme conditions, such as memory leaks, resource contention, gradual performance degradation, and lack of fault tolerance mechanisms. This is the final, yet crucial, line of defense for ensuring service high availability and production environment robustness.

4.2 Testing Content and Scope

  • Long-term Stability (Stability) Testing: Uses JSON under tests/dfx/stability/tests/ (for example test_qwen3_omni.json and test_wan22.json) to run the service under moderate load for an extended period (e.g., over 12 hours), monitoring whether metrics like memory/VRAM usage, response time, and throughput degrade over time, and whether the service process remains stable.
  • Reliability Testing: Uses pytest suites under tests/dfx/reliability/ to inject controlled faults against a live vllm_omni serve instance (same omni_server / omni_server_function fixture style as E2E). Current suites emphasize GPU memory pressure (CUDA sidecar “memory hog”), worker / runtime process kill (SIGKILL on VLLM::Worker for Qwen3-Omni or multiprocessing.spawn for Wan2.2 video workers), large multimodal chat or /v1/videos jobs under OOM, /health → 503 and fast-fail / non-hanging concurrent requests after kill, and OpenAI-style 5xx error contracts (e.g. text vs text+audio under OOM). Post-fault recovery checks exist where enabled (some cases may be skip while issues are tracked). See the Reliability <details> block in Section 4.4 for file-level responsibilities and CI markers (slow, hardware_test, POSIX-only kill).

4.3 Test Directory and Execution Files

  • Stability Test Configuration: tests/dfx/stability/tests/test_qwen3_omni.json, tests/dfx/stability/tests/test_wan22.json (one JSON per model / runner family)
  • Reliability Test Suite (tests/dfx/reliability/):
    • test_reliability_qwen3_omni.py — Qwen3-Omni chat / multimodal reliability (GPU OOM, process kill, recovery, error contract under --async-chunk vs default).
    • test_reliability_wan22.py — Wan2.2 T2V video API reliability (/v1/videos under OOM and process kill, recovery).
    • helpers.py — Shared primitives used by current suites: raw HTTP probes for /v1/chat/completions and /health, OpenAI-style error parsing, GPU OOM sidecar (inject_gpu_oom / stop_gpu_oom_hogs), and pgrep-based process-kill injector construction (make_process_kill_fault_injector).
    • conftest.pyfault_injector and omni_server_after_fault / omni_server_after_fault_function fixtures to run a callable after the server is ready.
    • README.md — Short local run commands for this directory.

4.4 Execution Method and Example

  • Trigger Timing: Weekly (weekly) or Days before Release (several days before a major release). Due to long execution times, the frequency is lower.
  • Execution Environment: GPU servers, requiring a stable and exclusive testing environment.
  • Script Example:
Test Examples When you want to add L5-level stability test cases, add or extend the appropriate JSON file under `tests/dfx/stability/tests/` (for example `test_qwen3_omni.json` for Omni bench traffic, or `test_wan22.json` for diffusion `/v1/videos` workloads). The following illustrates the Qwen3-Omni shape:
{
    "test_name": "test_qwen3_omni_stability",
    "server_params": {
        "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
        "stage_config_name": "qwen3_omni.yaml"
    },
    "benchmark_params": [
        {
            "dataset_name": "random",
            "backend": "openai-chat-omni",
            "endpoint": "/v1/chat/completions",
            "duration_sec": 43200,
            "request_rate": 0.5,
            "num_prompts_per_batch": 20,
            "random_input_len": 2500,
            "random_output_len": 900,
            "ignore_eos": true,
            "percentile-metrics": "ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration"
        }
    ]
}
#### Parameter Explanation ***Overview*** | Field | Required | Description | | ---------------- | -------- | --------------------------------------------------------------------------- | | test_name | Yes | Unique identifier for the stability test case | | server_params | Yes | Server-side configuration parameters (model, stage configuration, etc.) | | benchmark_params | Yes | Stability benchmark running parameters (supports multiple configurations) | #### server_params Configuration ##### Basic Parameters | Parameter | Required | Example | Description | | ----------------- | -------- | ---------------------------------- | ----------------------------------- | | model | Yes | "Qwen/Qwen3-Omni-30B-A3B-Instruct" | Model name or path | | stage_config_name | Yes | "qwen3_omni.yaml" | Stage configuration file name | ##### Dynamic Configuration (update/delete) Supports incremental modifications based on the basic configuration: | Operation | Description | | --------- | ------------------------------------ | | update | Update or add configuration items | | delete | Delete specified configuration items | ***Example***: You can refer to Test Examples in Chapter 3.4 #### benchmark_params Configuration For stability testing, the key parameters are: - **duration_sec**: Total duration (in seconds) during which benchmark traffic is sent. The stability benchmark will keep sending batches until this duration is reached. - **request_rate** / **max_concurrency**: Exactly one of them must be specified. They control how the traffic is generated for each batch: - `request_rate`: Number of requests per second. The benchmark will send `num_prompts_per_batch` requests at the given rate. - `max_concurrency`: Maximum number of concurrent requests. When this is used, `request_rate` is set to `inf` internally. - **num_prompts_per_batch**: Number of prompts sent in each batch. Multiple batches will be executed sequentially within `duration_sec`. All other optional parameters follow the same rules as the in Chapter 3.4.
Reliability test suite (tests/dfx/reliability) #### Purpose and relationship to stability Reliability tests are **short fault-injection** integration runs (L5 **(b)** in `tests/dfx/reliability/README.md`). They complement **stability** JSON-driven long runs: instead of hours of steady traffic, they **perturb** the server (GPU OOM sidecar, fatal signals on selected processes) and check **failure mode** and **latency bounds** (e.g. chat or `/v1/videos` must not hang under concurrent fault-time load). #### Directory layout | Path | Responsibility | | ---- | -------------- | | `helpers.py` | Shared helpers used by current reliability suites: raw `POST`/`GET` probes (`/v1/chat/completions`, `/health`), OpenAI error parsing (`extract_openai_error_contract_from_bytes`), GPU OOM sidecar lifecycle (`inject_gpu_oom`, `stop_gpu_oom_hogs`), and process-kill injector builder (`make_process_kill_fault_injector`). | | `conftest.py` | Pytest fixtures: indirect `fault_injector`, `omni_server_after_fault` / `omni_server_after_fault_function` (run injector after server is ready, then yield server). | | `test_reliability_qwen3_omni.py` | Qwen3-Omni: OOM vs **text vs text+audio** error contract, large multimodal chat under OOM, concurrent pressure, **SIGKILL** on `VLLM::Worker`, `/health` → 503 + fast-fail + concurrent chat; optional OOM recovery scenario (may be skipped while tracked in issues). | | `test_reliability_wan22.py` | Wan2.2 T2V: large `/v1/videos` under OOM, **SIGKILL** on `multiprocessing.spawn` chain, health / fast-fail / concurrent video requests; optional recovery test (may be skipped). | | `README.md` | Minimal run / collect examples. | #### Parametrization and markers - Each test module defines a **`RELIABILITY_SCENARIOS`** list (`test_name`, `server_params`: model, `stage_config_name` or diffusion `server_args`, etc.). **`create_reliability_omni_server_params()`** in `tests/dfx/conftest.py` resolves stage paths (including XPU substitutions where applicable) and builds **`OmniServerParams`** lists consumed by **`@pytest.mark.parametrize(..., indirect=True)`** on `omni_server` or `omni_server_function`. - Cases are tagged **`@pytest.mark.slow`** for weekly / selective CI. GPU-heavy suites use **`@hardware_test(res={"cuda": "H100"}, num_cards=...)`** (Qwen3-Omni paths often require **2** cards; Wan2.2 video paths **1** card). - **Process-kill** tests use **`@pytest.mark.skipif(os.name == "nt", ...)`** because injection uses POSIX **`pgrep` / `kill`**. #### CI trigger Weekly Buildkite (`.buildkite/test-weekly.yml`) runs, for example:
pytest -s -v tests/dfx/reliability/test_reliability_qwen3_omni.py -m "slow"
pytest -s -v tests/dfx/reliability/test_reliability_wan22.py -m "slow"
#### Local commands
pytest --collect-only tests/dfx/reliability
pytest -s -v tests/dfx/reliability/test_reliability_qwen3_omni.py -m slow
pytest -s -v tests/dfx/reliability/test_reliability_wan22.py -m slow
#### Adding a new model suite 1. Add `test_reliability_.py` under `tests/dfx/reliability/`. 2. Define **`RELIABILITY_SCENARIOS`** and pass them through **`create_reliability_omni_server_params()`** with the correct deploy or e2e stage-config directory (same pattern as existing files). 3. Reuse **`helpers`** for OOM / kill / raw HTTP; prefer **`assert_fault_exception()`** and **`resolve_oom_device_spec()`** from `tests/dfx/conftest.py` for consistent device selection vs stage YAML. 4. Register **`slow`** (and **`hardware_test`** if needed); extend **`.buildkite/test-weekly.yml`** when the suite should run in weekly L5.
    • Stability: pytest -s -v tests/dfx/stability/scripts/test_stability_qwen3_omni.py or pytest -s -v tests/dfx/stability/scripts/test_stability_wan22.py (or add test_stability_<model>.py alongside a matching JSON config)
    • Reliability: pytest -s -v tests/dfx/reliability/test_reliability_qwen3_omni.py -m slow and/or pytest -s -v tests/dfx/reliability/test_reliability_wan22.py -m slow (add test_reliability_<suite>.py for new models)

Summary

This multi-level testing system achieves continuous, progressive validation of model service quality by tightly integrating testing activities with the development workflow (commit, review, merge, release). From rapid unit testing to comprehensive end-to-end testing, and further to in-depth performance, stability, and reliability verification, each level has clear objectives, collectively building a robust quality protection net. By following this system, teams can deliver high-quality, highly reliable model services more efficiently.