Multi-Level Automated Testing System Documentation¶
Document Overview¶
This testing system aims to build a complete, efficient, and well-structured quality assurance framework for the development, integration, and release of model services. It draws on the concept of the test pyramid from modern software engineering, progressively expanding testing activities from basic code logic verification to complex end-to-end (E2E) functionality, performance, accuracy, and even long-term stability validation.
Through five levels (L1-L5) and common (Common) specifications, the system clarifies the testing objectives, scope, execution frequency, and required resources for different development stages (e.g., each commit, PR merge, daily build, pre-release). This ensures that models meet high standards for functionality, performance, and reliability across various deployment scenarios (online serving and offline inference).
| Level | Scope & Focus | Time Cost | Test Dir | Doc | Frequency | Hardware |
|---|---|---|---|---|---|---|
| Common | Contribution Guideline & PR checklist | / | / | .github/PULL_REQUEST_TEMPLATE.md Test Style (PR Checklist) | / | / |
| CI Failure Description | / | / | CI Failures | / | / | |
| L1 (Unit & Logic) | Unit tests for components like entrypoints, models | <15min | /tests/{component_name}/test_xxx | Chapter 1 Section 1 L1&L2: Purpose, Test Content, Directory Location, Example | PR with ready label (also can run locally) | CPU |
| L2 (E2E across models & GPU-required UT) | Online & Offline (basic deployment scenarios): dummy, normal inference function (output format, stream), some instance startup UT | /tests/e2e/online_serving/test_{model_name}.py /tests/e2e/offline_inference/test_{model_name}.py | Chapter 1 L1&L2: Purpose, Test Content, Directory Location, Example | PR with ready label | GPU | |
| L3 (Important Perf & Integration & Accuracy) | Online & Offline (multiple deployment scenarios): real model, normal inference function, normal accuracy | <30min | /tests/e2e/online_serving/test_{model_name}.py /tests/e2e/offline_inference/test_{model_name}.py | Chapter 2 L3: Purpose, Test Content, Directory Location, Example | PR Merged (Also run L1&L2 Tests) | GPU |
| L4 (Perf & Integration & Accuracy) | Online & Offline: full functional scenarios + performance test + doc test | <3 hour | Full Function: /tests/e2e/online_serving/test_{model_name}_expansion.py /tests/e2e/offline_inference/test_{model_name}_expansion.py Performance: /tests/dfx/perf/tests/test_qwen_omni.json (Omni), test_tts.json (TTS), and /tests/dfx/perf/tests/test_{diffusion_model}_vllm_omni.json (Diffusion) Doc Test: tests/example/online_serving/test_{model_name}.py tests/example/offline_inference/test_{model_name}.py | Chapter 3 L4: Purpose, Test Content, Directory Location, Example | Nightly | GPU |
| L5 (Stability & Reliability) | Online & Offline: long-term stability test + reliability test | Depends on reality | Stability: /tests/dfx/stability/tests/test_qwen3_omni.json /tests/dfx/stability/tests/test_wan22.json Reliability: tests/dfx/reliability/test_reliability_{model_key}.py (e.g. test_reliability_qwen3_omni.py, test_reliability_wan22.py) | Chapter 4 L5: Purpose, Test Content, Directory Location, Example | Weekly / Days before Release | GPU |
The folder structure for tests file based on the 5 levels design
Legend: `✅` = test exists, `⬜` = suggested to add.vllm_omni/ tests/
├── config/ → ├── config/
│ ├── model.py │ └── test_model.py ⬜
│ └── lora.py │ └── test_lora.py ⬜
│
├── core/ → ├── core/
│ └── sched/ │ └── sched/
│ ├── omni_ar_scheduler.py │ ├── test_omni_ar_scheduler.py ⬜
│ ├── omni_generation_scheduler.py │ ├── test_omni_generation_scheduler.py ⬜
│ └── output.py │ └── test_output.py ✅ currently in entrypoints/test_omni_new_request_data.py (tests output.OmniNewRequestData)
│
├── diffusion/ → ├── diffusion/
│ ├── diffusion_engine.py │ ├── test_diffusion_engine.py ⬜
│ ├── attention/ │ ├── attention/
│ │ ├── layer.py │ │ ├── test_attention_sp.py ✅
│ │ └── backends/ │ │ └── test_flash_attn.py ✅
│ ├── distributed/ │ ├── distributed/
│ │ └── ... │ │ ├── test_comm.py ✅
│ │ │ │ ├── test_cfg_parallel.py ✅
│ │ │ │ └── test_sp_plan_hooks.py ✅
│ ├── lora/ │ ├── lora/
│ │ └── ... │ │ ├── test_base_linear.py ✅
│ │ │ │ └── test_lora_manager.py ✅
│ ├── models/ │ ├── models/
│ │ ├── qwen_image/ │ │ ├── qwen_image/ (e2e coverage)
│ │ ├── ovis_image/ │ │ ├── ovis_image/
│ │ │ └── ... │ │ │ └── test_ovis_image.py ✅
│ │ ├── z_image/ │ │ └── z_image/
│ │ └── ... │ │ └── test_zimage_tp_constraints.py ✅
│ └── worker/ │ └── worker/
│ ├── diffusion_worker.py │ └── test_diffusion_worker.py ✅ file at diffusion/test_diffusion_worker.py
│ └── diffusion_model_runner.py │
│
├── distributed/ → ├── distributed/
│ └── omni_connectors/ │ └── omni_connectors/
│ ├── adapter.py │ ├── test_adapter_and_flow.py ✅
│ ├── kv_transfer_manager.py │ ├── test_basic_connectors.py ✅
│ ├── connectors/ │ ├── test_kv_flow.py ✅
│ └── utils/ │ └── test_omni_connector_configs.py ✅
│
├── engine/ → ├── engine/
│ ├── input_processor.py │ ├── test_input_processor.py ⬜ (no processor.py in source)
│ ├── output_processor.py │ └── test_output_processor.py ⬜
│ └── arg_utils.py │ └── test_arg_utils.py ⬜
│
├── entrypoints/ → ├── entrypoints/
│ ├── stage_utils.py │ ├── test_stage_utils.py ✅
│ ├── cli/ │ ├── cli/ (benchmarks/test_serve_cli.py covers CLI serve)
│ │ └── ... │ │ └── test_*.py ⬜
│ └── openai/ │ └── openai_api/ # maps to entrypoints/openai/
│ ├── api_server.py │ ├── test_api_server.py ⬜ (e2e indirect coverage)
│ ├── serving_chat.py │ ├── test_serving_chat_sampling_params.py ✅
│ ├── serving_speech.py │ ├── test_serving_speech.py ✅
│ └── image_api_utils.py │ └── test_image_server.py ✅
│
├── inputs/ → ├── inputs/
│ ├── data.py │ ├── test_data.py ⬜
│ ├── parse.py │ ├── test_parse.py ⬜
│ └── preprocess.py │ └── test_preprocess.py ✅ currently in entrypoints/test_omni_input_preprocessor.py
│
├── model_executor/ → ├── model_executor/
│ ├── layers/ │ ├── layers/
│ │ └── mrope.py │ │ └── test_mrope.py ⬜
│ ├── model_loader/ │ ├── model_loader/
│ │ └── weight_utils.py │ │ └── test_weight_utils.py ⬜
│ ├── models/ │ ├── models/
│ │ ├── qwen2_5_omni/ │ │ ├── qwen2_5_omni/
│ │ │ ├── qwen2_5_omni_thinker.py │ │ │ ├── test_audio_length.py ✅
│ │ │ ├── qwen2_5_omni_talker.py │ │ │ ├── test_qwen2_5_omni_thinker.py ⬜
│ │ │ └── qwen2_5_omni_token2wav.py │ │ │ ├── test_qwen2_5_omni_talker.py ⬜
│ │ └── qwen3_omni/ │ │ │ └── test_qwen2_5_omni_token2wav.py ⬜
│ │ └── ... │ │ └── qwen3_omni/
│ ├── stage_configs/ │ │ └── test_*.py ⬜
│ │ └── *.yaml │ └── stage_configs/ (used by e2e, test_*.py can be added) ⬜
│ └── stage_input_processors/ │ └── stage_input_processors/
│ └── ... │ └── test_*.py ⬜
│
├── sample/ → ├── sample/
│ └── __init__.py │ └── test_*.py ⬜
│
├── utils/ → ├── utils/
│ └── __init__.py │ └── test_*.py ⬜ (no platform_utils.py currently)
│
├── worker/ → ├── worker/
│ ├── gpu_ar_model_runner.py │ ├── test_gpu_ar_model_runner.py ⬜
│ ├── gpu_ar_worker.py │ ├── test_gpu_ar_worker.py ⬜
│ ├── gpu_generation_model_runner.py │ ├── test_gpu_generation_model_runner.py ✅
│ ├── gpu_generation_worker.py │ ├── test_gpu_generation_worker.py ⬜
│ ├── gpu_model_runner.py │ ├── test_omni_gpu_model_runner.py ✅
│ └── mixins.py │ └── (npu under platforms/npu/worker/) # not worker/npu/
│
├── platforms/ → (no tests/platforms/, e2e and stage_configs provide indirect coverage)
│ ├── cuda/
│ ├── npu/worker/ # NPU worker here, not vllm_omni/worker/npu/
│ ├── rocm/
│ └── xpu/worker/
│
├── outputs.py → test_outputs.py ✅ (at tests root)
├── (logger, patch, request, version) → (no corresponding unit test)
│
└── e2e (tests side only) → ├── e2e/
├── online_serving/ ✅ non-empty
│ ├── test_qwen2_5_omni.py
│ ├── test_async_omni.py
│ ├── test_qwen3_omni.py
│ ├── test_qwen3_omni_expansion.py
│ ├── test_mimo_audio.py
│ └── test_images_generations_lora.py
└── offline_inference/ ✅
├── test_qwen2_5_omni.py
├── test_qwen3_omni.py
├── test_bagel_text2img.py
├── test_z_image.py
├── test_wan22.py
├── test_zimage_tensor_parallel.py
├── test_cache_dit.py
├── test_teacache.py
├── test_stable_audio_expansion.py
├── test_diffusion_cpu_offload.py
├── test_diffusion_layerwise_offload.py
├── test_diffusion_lora.py
├── test_sequence_parallel.py
└── stage_configs/ (legacy schema, still
├── bagel_*.yaml present for unmigrated
└── npu/, rocm/, etc. models)
# Migrated models (qwen3_omni_moe, qwen2_5_omni, qwen3_tts) live under
# vllm_omni/deploy/ instead — see docs/configuration/stage_configs.md.
Common Specifications¶
Before entering specific testing levels, the project establishes two common specifications aimed at standardizing the development process and quickly locating issues.
- PR Checklist (Tests Style): This template defines the self-check items that must be completed before submitting a code review (Pull Request). It ensures that each code change meets basic requirements such as code style, dependency updates, and documentation synchronization before entering the automated testing pipeline, serving as the first manual line of defense for quality assurance.
- CI Failure Explanation (CI Failures): This document archives and explains common failure patterns in the Continuous Integration (CI) pipeline, error log interpretation, and preliminary troubleshooting steps. It helps developers and testers quickly diagnose the causes of automated test failures, improving problem-solving efficiency.
Chapter 1: L1 & L2 Level Testing - Unit Testing and Basic End-to-End Verification¶
1.1 Testing Purpose¶
L1 and L2 level testing form the foundation of the quality assurance system. L1 level testing focuses on verifying the internal logic correctness of code units (e.g., functions, classes), ensuring each independent component behaves as designed.
L2 level testing builds upon L1 by introducing GPU resources and verifying that the end-to-end (E2E) process of the model in basic deployment scenarios is smooth. For example, it uses dummy models to confirm that core interfaces like the inference pipeline, output format, and streaming response work properly. The common goal of these two levels is to provide developers with rapid feedback, discovering and fixing issues early in the development cycle.
1.2 Testing Content and Scope¶
- L1 (Unit & Logic Testing):
-
- Scope: Tests internal functions and methods of core components such as
entrypoints,models. - Focus: Branch coverage, exception handling, algorithm logic correctness. Does not involve external dependencies or the complete service stack.
- Time Cost: Execution time is controlled within 15 minutes to ensure fast feedback.
- Scope: Tests internal functions and methods of core components such as
- L2 (Basic End-to-End Testing):
-
- Scope: Covers two basic deployment scenarios:
online(serving) andoffline(inference). - Focus: Uses
dummymodels or lightweight real models to verify that the entire chain from request input to result output works normally, including output data structure, streaming (stream) support, etc. Also includes some unit tests that require launching independent service instances. - Characteristic: Requires GPU resources to perform model computations.
- Scope: Covers two basic deployment scenarios:
1.3 Test Directory and Execution Files¶
A clear directory structure is key to managing test cases efficiently.
- L1 Test Directory:
/tests/{component_name}/test_xxx.py -
- Here,
{component_name}corresponds to modules in the source code, such asdistributed,entrypoints, etc., andtest_xxx.pyis the specific test file.
- Here,
- L2 Test Directory:
-
- Online Serving:
/tests/e2e/online_serving/test_{model_name}.py - Offline Inference:
/tests/e2e/offline_inference/test_{model_name}.py
- Online Serving:
1.4 Execution Method and Example¶
- Trigger Timing:
PR with ready label. That is, when a developer adds a "ready for review" or similar label to a PR on platforms like GitHub, L1 and L2 tests are automatically triggered. - Execution Environment: L1 uses CPU environment; L2 requires GPU environment.
- Script Example:
L1 Test Examples
Examples from `tests/model_executor/models/qwen2_5_omni/test_audio_length.py`# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import pytest
pytestmark = [pytest.mark.core_model, pytest.mark.cpu]
def test_resolve_max_mel_frames_default():
from vllm_omni.model_executor.models.qwen2_5_omni.audio_length import resolve_max_mel_frames
assert resolve_max_mel_frames(None, default=30000) == 30000
assert resolve_max_mel_frames(None, default=6000) == 6000
def test_resolve_max_mel_frames_explicit():
from vllm_omni.model_executor.models.qwen2_5_omni.audio_length import resolve_max_mel_frames
# Explicit argument always wins over default
assert resolve_max_mel_frames(123, default=30000) == 123
assert resolve_max_mel_frames(6000, default=30000) == 6000
assert resolve_max_mel_frames(0, default=30000) == 0
@pytest.mark.parametrize("repeats", [2, 4])
@pytest.mark.parametrize("code_len", [0, 1, 32768])
@pytest.mark.parametrize("max_mel_frames", [None, -1, 0, 1, 6000, 30000])
def test_cap_and_align_mel_length_no_mismatch(repeats, code_len, max_mel_frames):
"""Guard that any max_mel_frames yields a mel length aligned to repeats, and
consistent with the truncated code length (prevents concat mismatch).
"""
from vllm_omni.model_executor.models.qwen2_5_omni.audio_length import cap_and_align_mel_length
target_code_len, target_mel_len = cap_and_align_mel_length(
code_len=code_len,
repeats=repeats,
max_mel_frames=max_mel_frames,
)
assert isinstance(target_code_len, int)
assert isinstance(target_mel_len, int)
if code_len == 0:
assert target_code_len == 0
assert target_mel_len == 0
return
assert target_code_len >= 1
assert target_mel_len >= repeats
assert target_mel_len % repeats == 0
assert target_mel_len == target_code_len * repeats
assert target_code_len <= code_len
if max_mel_frames is not None and int(max_mel_frames) > 0 and int(max_mel_frames) >= repeats:
assert target_mel_len <= int(max_mel_frames)
L2 Test Examples
You can refer to Test Examples in Chapter 2 to see example test cases that incorporate both L2 and L3 testing logic.-
- Run Command:
pytest -s -v /tests/e2e/online_serving/test_{model_name}.pypytest -s -v -m 'core_model and cpu' --run-level=core_model
Chapter 2: L3 Level Testing - Core Integration, Performance, and Accuracy Verification¶
2.1 Testing Purpose¶
L3 level testing executes after code is merged into the main branch. Its core purpose is to verify the integration effect, key performance indicators, and output accuracy of real models in multiple deployment scenarios
. It acts as the "quality gatekeeper" for the main branch, ensuring that no merge breaks the core capabilities of the model service. Testing needs to provide clear conclusions within a relatively short time (<30min), balancing test depth with feedback speed.
2.2 Testing Content and Scope¶
- Deployment Scenarios: Covers richer
onlineandofflinedeployment configurations, which may include different hardware configurations, batch sizes, concurrency levels, etc. - Core Verification:
-
- Inference Functionality: Ensures real models can perform forward computation normally and return results.
- Accuracy Compliance: Verifies that the model's evaluation metrics (e.g., accuracy) meet the expected baseline, preventing code changes from introducing accuracy issues.
- Important Performance: Verifies whether performance (e.g., P99 latency, throughput) in core scenarios meets preset thresholds.
2.3 Test Directory and Execution Files¶
- Functional Testing:
-
- Online Serving:
/tests/e2e/online_serving/test_{model_name}_expansion.py - Offline Inference:
/tests/e2e/offline_inference/test_{model_name}_expansion.py - (Note:
_expansion.pylikely means it contains more comprehensive scenario cases compared to L2 tests).
- Online Serving:
2.4 Execution Method and Example¶
- Trigger Timing:
PR Merged. Automatically triggered after code review is approved and merged into the main branch. - Execution Environment: GPU servers.
- Script Example:
Test Examples
2.4.1 Mark Declaration Section
@pytest.mark.advanced_model
@pytest.mark.core_model
@pytest.mark.parametrize("omni_server", test_params, indirect=True)
Explanation:
@pytest.mark.advanced_model: Marks the test as L3 merge level, indicating deep validation with real models. @pytest.mark.full_model: Marks L4 nightly-only suites (e.g. test_*_expansion.py, doc examples).
@pytest.mark.core_model: Marks the test as L1 or L2 level, indicating that this test case validates the basic functionality of the core model. It uses mock weights and only checks if the relevant interface functions correctly.
@pytest.mark.parametrize: A parameterization decorator that allows abstracting test data into parameters, enabling reuse of the same test logic across different data configurations. indirect=True indicates that parameters will be passed to the fixture for processing.
Notes: If you believe the test case only needs to execute basic run logic at the PR-level CI, you can mark it only with @pytest.mark.core_model. If you believe it only needs to execute deep validation at merge (L3), use @pytest.mark.advanced_model. For L4 nightly-only expansion and doc-example tests, use @pytest.mark.full_model with --run-level full_model. If the test case needs both basic run and deep validation, mark with @pytest.mark.core_model and the appropriate L3/L4 marker (advanced_model and/or full_model).
2.4.2 Test Function Definition and Documentation
def test_mix_to_text_audio_001(omni_server, openai_client) -> None:
"""
Test multi-modal input processing and text/audio output generation via OpenAI API.
Deploy Setting: default yaml
Input Modal: text + audio + video + image
Output Modal: text + audio
Input Setting: stream=True
Datasets: single request
"""
Explanation:
Function Naming Convention: Uses the test_ prefix, describes the test scenario mix_to_text_audio, and the number 001 indicates the first test case for this scenario.
Parameter Explanation:
omni_server: Omni server instance obtained via fixture, containing model information and configuration.
openai_client: Unified OpenAI client processing instance, encapsulating request sending and response validation logic.
Docstring: Describes the test purpose, deployment settings, input/output modalities, streaming settings, and dataset type in detail, providing clear context for test maintenance.
2.4.3 Multimodal Data Preparation
video_data_url = f"data:video/mp4;base64,{generate_synthetic_video(224, 224, 300)['base64']}"
image_data_url = f"data:image/jpeg;base64,{generate_synthetic_image(224, 224)['base64']}"
audio_data_url = f"data:audio/wav;base64,{generate_synthetic_audio(5, 1)['base64']}"
Explanation:
Data Generation Functions: Use the generate_synthetic_* series of functions to generate synthetic test data, avoiding reliance on external resources and ensuring test reproducibility and stability.
Parameter Explanation:
Video: width, height, duration_frames
Image: width, height
Audio: duration_seconds, channels
2.4.4 Request Configuration and Keyword Validation
request_config = {
"model": omni_server.model,
"messages": messages,
"stream": True,
"key_words": {
"audio": ["water", "cricket"],
"video": ["sphere", "globe", "circle", "round"],
"image": ["square", "quadrate"],
"text": ["beijing"]
},
}
Explanation:
Model Specification: Uses omni_server.model to ensure the test aligns with the model configured on the server.
Keyword Validation Mechanism: This is an innovative design of the template to address the specific needs of multimodal testing:
Audio Keywords: Validate whether the generated text's description of audio content contains expected elements (e.g., "water" for water sounds, "cricket" for cricket sounds). If you provide multiple keywords, the validation is considered successful if at least one keyword is present.
Video Keywords: Validate whether the generated text's description of video content contains expected elements. If you provide multiple keywords, the validation is considered successful if at least one keyword is present.
Image Keywords: Validate whether the generated text's description of image content contains expected elements. If you provide multiple keywords, the validation is considered successful if at least one keyword is present.
Text Keywords: Validate whether the generated text contains expected elements. If you provide multiple keywords, the validation is considered successful if at least one keyword is present.
2.4.5 Request Execution
openai_client.send_omni_request(request_config, request_num=1) # for omni-understanding models
# or
openai_client.send_diffusion_request(request_config, request_num=1) # for diffusion models
Explanation:
Unified Client: Uses the OpenAIClientHandler instance to send requests. This client encapsulates error handling, retry mechanisms, and response validation logic.
Single Request: The comment clearly states this is a single-request completion test. For concurrent testing, it can be extended to multiple requests using request_num = n.
Implicit Validation: The send_omni_request and send_diffusion_request methods internally includes validation logic dynamically selected based on the --run-level parameter: core_model performs basic validation, while advanced_model and full_model perform deep validation.
-
Run Command (L3 merge):
pytest -s -v /tests/e2e/online_serving/test_{model_name}.py -m advanced_model --run-level=advanced_model -
Run Command (L4 nightly expansion):
pytest -s -v /tests/e2e/online_serving/test_{model_name}_expansion.py -m full_model --run-level=full_model
Chapter 3: L4 Level Testing - Full Functionality, Performance, and Documentation Testing¶
3.1 Testing Purpose¶
L4 level testing is a comprehensive quality audit before a version release. It expands upon L3, executing full functional scenarios, conducting systematic performance stress tests, and simultaneously verifying the correctness of accompanying example documentation. Its purpose is to perform deep validation of the system during off-peak nighttime hours, providing quality trend reports for daytime development and data support for release decisions.
3.2 Testing Content and Scope¶
- Full Functionality Testing: Executes all test cases defined in
test_{model_name}_expansion.py, covering all implemented features, positive flows, boundary conditions, and exception handling. - Performance Testing: Uses
tests/dfx/perf/tests/test_qwen_omni.json,tests/dfx/perf/tests/test_tts.json, and diffusion configs in the formtests/dfx/perf/tests/test_*_vllm_omni.json(passed torun_benchmark.pyvia--test-config-file) to drive performance testing tools for stress, load, and endurance tests, collecting metrics like throughput, response time, and resource utilization. - Documentation Testing: Verifies whether the example code provided to users is runnable and its results match the description.
3.3 Test Directory and Execution Files¶
- Functional Testing: Same directories as L3.
- Performance Test Configuration:
tests/dfx/perf/tests/test_qwen_omni.json,tests/dfx/perf/tests/test_tts.json, and diffusion configstests/dfx/perf/tests/test_*_vllm_omni.json(e.g.test_qwen_image_vllm_omni.json) - Documentation Example Tests:
-
tests/example/online_serving/test_{model_name}.pytests/example/offline_inference/test_{model_name}.py
3.4 Execution Method and Example¶
- Trigger Timing:
Nightly, automatically executed every night. - Execution Environment: GPU server clusters to meet the resource demands of performance testing.
- Script Example:
Test Examples: Documentation Example Tests
Preferred Test Strategy
Use one of the following patterns depending on page type:
-
Dynamic code-block extraction (preferred for offline docs)
- Extract Python/Bash code blocks from markdown AST analyzer, then execute them directly in tests.
- Benefit: test logic stays automatically aligned with docs.
- Basic idea: Use
ReadmeSnippet.extract_readme_snippetsto extract a list of code blocks as a global variable in file, use this list aspytest.mark.parametrizeparameters, and pass each snippet item toexample_runner.runinside the parametrized test. Additionally pass anoutput_subfolderargument for the 2nd-level output folder explained in Output Directory Structure below. If any extra environment variable is need for a test (e.g., the example script reads it),example_runner.runalso accepts a 3rdenvparameter. - See tests/examples/offline_inference/test_text_to_image.py for reference implementation.
-
Explicit copied scripts (used by online docs for now until further update)
- For online serving pages, it is acceptable to copy code from docs into dedicated test functions, because only client-side, request-sending scripts are tested.
- Benefit: dynamic extraction is overly complex: need to tell server-launch and client-request scripts.
- Requirement: copied test code must be kept in sync with doc updates.
Test Case Naming Convention
- Dynamic code extraction (auto-generated internally):
test_{single_function_name_matching_file_name}[h2_heading_00X]- Example:
test_text_to_image[basic_usage_001]
- Explicit copied scripts:
test_{h2_heading_00X}[{dummy_param_id_for_omni_server}]- Example:
test_api_calls_001[omni_server0]
Runtime Configuration
In the example code tests, do not reduce num_inference_steps just to speed up the tests unless there is a strong CI reliability reason to do otherwise.
Skipping Rules
You may skip examples falling in the following categories using pytest.mark.skip or pytest.skip:
- Gradio UI scripts
- Scenarios that significantly overlap with existing tests and add little new coverage.
Output Directory Structure
Use a three-layer output structure to store output artifacts:
- Root output directory
- Auto-detected from
OUTPUT_DIRenv var or auto-generated under/tmp.
- Auto-detected from
- Doc-page directory
- Define and use a clear page-level folder name in each
test_*.pyyourself (abbreviations are acceptable, e.g.,example_offline_t2i).
- Define and use a clear page-level folder name in each
- Test-case directory
- Must match the case identifier (e.g.,
basic_usage_001). - Auto-generated for dynamic extracted tests.
- Must match the case identifier (e.g.,
Test Examples: Performance Tests
When you want to add L4-level performance test cases, you can refer to the following format for case addition in tests/dfx/perf/tests/test_qwen_omni.json, tests/dfx/perf/tests/test_tts.json, or diffusion configs such as tests/dfx/perf/tests/test_*_vllm_omni.json (selected via pytest ... run_benchmark.py --test-config-file <path>):
{
"test_name": "test_qwen3_omni",
"server_params": {
"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
"stage_config_name": "qwen3_omni.yaml"
},
"benchmark_params": [
{
"dataset_name": "random",
"num_prompts": [10, 20],
"max_concurrency": [1, 4],
"random_input_len": 2500,
"random_output_len": 900,
"ignore_eos": true,
"percentile-metrics": "ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration",
"baseline": {
"mean_ttft_ms": [500, 800],
"mean_audio_ttfp_ms": [2000, 3500],
"mean_audio_rtf": [0.25, 0.35]
}
}
]
}
Parameter Explanation
Overview
| Field | Required | Description |
|---|---|---|
| test_name | Yes | Unique identifier for the test case |
| server_params | Yes | Server-side configuration parameters |
| benchmark_params | Yes | Benchmark running parameters (supports multiple configurations) |
server_params Configuration
Basic Parameters
| Parameter | Required | Example | Description |
|---|---|---|---|
| model | Yes | "Qwen/Qwen3-Omni-30B-A3B-Instruct" | Model name or path |
| stage_config_name | Yes | "qwen3_omni.yaml" | Stage configuration file name |
Dynamic Configuration (update/delete)
Supports incremental modifications based on the basic configuration:
| Operation | Description |
|---|---|
| update | Update or add configuration items |
| delete | Delete specified configuration items |
Example:
"update": {
"async_chunk": true, // Enable asynchronous chunk processing
"stage_args": {
"0": {
"engine_args.custom_process_next_stage_input_func": "vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk"
}
}
},
"delete": {
"stage_args": {
"2": ["custom_process_input_func"] // Delete this configuration for stage 2
}
}
benchmark_params Configuration
You can add any benchmark running parameters you need here. For all optional parameters, refer to the benchmark documentation. General modifications are as follows:
- Change the --xxx-xx-xx running parameters to xxx_xx_xx format and fill them as keys in the JSON file.
- For boolean variables in the running parameters, modify them to forms such as ignore_eos: true/false and fill them into the JSON file.
- Optionally add a
baselineobject (see Baseline thresholds below). If you omitbaselineor leave it empty, the performance test still runs but does not assert metric thresholds from this field. - The qps and concurrency modes are recommended to be mutually exclusive. For detailed explanations, see the table below:
| Parameter | Type | Required | Example/Values | Description |
|---|---|---|---|---|
| num_prompts | int / array | Yes | 10,[10, 20, 30] | Number of requests. Supports single values or arrays. If a single value is used, it will be automatically expanded to match the number of qps or max_concurrency, e.g., [10,10,10]. If an array is used, its length must match the number of qps or max_concurrency. |
| request_rate | float / array | No | 0.5, [0.5, 1, inf] | Queries per second. Supports single values or arrays. If a single value is used, it will be automatically expanded to match the number of num_prompts, e.g., [1,1,1]. If an array is used, its length must match the number of num_prompts. |
| max_concurrency | int / array | No | 1, [1, 2, 3] | Maximum concurrent in-flight requests. Same array / expansion rules as request_rate (mutually exclusive with QPS mode). |
| baseline | object | No | see above | Optional per-metric thresholds; keys must match benchmark output fields. Scalar, list (per sweep step), or object (keyed by concurrency or QPS string). |
Test Examples: Functionality Tests
Scope
For diffusion models, the L4 functionality test covers all or common diffusion features that are supported by this model, including several parallelism acceleration methods, CPU offloading, TeaCache and Cache-DiT cache backends, quantization methods.
Test Case Design
For a high priority model (currently listed in issue #1832), we use several test cases, each with multiple features turned on, so that each supported feature is tested in at least one test case. This is to relieve the GPU workload on the CI machine. The suggested test case combination is as follows:
- If the model can fit into 4 L4 GPU (with quantization and tensor parallel always on) (20GB RAM each)
- (1 GPU) TeaCache + Layerwise CPU offloading + GGUF
- (4 GPUs) CacheDiT + Ulysses=2 + TP=2 + VAE=2 + FP8
- (4 GPUs) CacheDiT + Ring=2 + HSDP=2 + VAE=2 + GGUF
- (4 GPUs) TeaCache + CFG=2 + TP=2 + VAE=2 + FP8
- Otherwise, consider 2 H100 GPU environment (80GB RAM each) with the following tests
- (1 GPU) TeaCache + Layerwise CPU offloading + GGUF
- (2 GPUs) CacheDiT + Ulysses=2 + FP8
- (2 GPUs) CacheDiT + Ring=2 + GGUF
- (2 GPUs) TeaCache + CFG=2 + FP8
- (2 GPUs) CacheDiT + TP=2 + VAE=2 + FP8
- (2 GPUs) CacheDiT + HSDP=2 + VAE=2 + GGUF
- If 2 H100 GPU cannot handle the model either (e.g., HunyuanImage 3.0)
- Still design tests and feature combinations that can best fit real-world scenario.
- Do not include it in CI (or exclude it from the CI's filtering criteria). Instead, relevant PR authors are suggested to run these tests locally.
- Fallback plan
- If the model does not support layerwise CPU offloading, replace the corresponding test case with module-wise offloading
- If the model only supports specific or no caching feature, use this option or remove this option in all test cases.
- If the model only supports specific or no quantization feature, use this option or remove this option in all test cases.
- If the model does not support certain other features, remove this option from that test case. If, consequently, the coverage of this modified test case completely overlaps with others, remove this test case.
For a normal priority model, further reduce the number of test cases.
- Only write one or two test cases for the most common feature combinations.
- The author can explore themselves to see which feature combination balances output quality and performance. Alternatively, the author can refer to any example code in the PR that adds the model, or the example code in the PR that adds a feature (if the code involves this model of interest).
Currently all the features are available in online serving mode. Hence, only need to add tests/e2e/online_serving/test_{model}_expansion.py.
Code Style
- Validation: test that the multimodal output files of your model have the correct shapes.
OpenAIClientHandler.send_diffusion_requestshould have taken care of this. - Test marks: always add
full_modelanddiffusionfor L4 nightlytest_*_expansion.pycases. Add GPU-related marks if needed. Ref: Markers for Tests. - To maximize code reuse, you may refer to
tests/conftest.pyforomni_server(running server in subprocess) andopenai_clientfixtures (sending requests and validating output),generate_synthetic_imageandassert_XXX_validhelper.tests/helpers/mark.pyfor@hardware_test(...)andhardware_marks.- Parametrizing tests (pytest doc) to reuse test function implementation for different cases.
- Doc: add a concise docstring for each test function.
- Reference L4 test implementation: tests/e2e/online_serving/test_qwen_image_edit_expansion.py.
- Run Command: (Specific commands would depend on the performance testing tool and configuration defined in
nightly.json).
Chapter 4: L5 Level Testing - Stability and Reliability Testing¶
4.1 Testing Purpose¶
L5 level testing focuses on the performance of model services under long-running and abnormal fault scenarios. It aims to uncover deep-seated issues that only manifest under sustained pressure or extreme conditions, such as memory leaks, resource contention, gradual performance degradation, and lack of fault tolerance mechanisms. This is the final, yet crucial, line of defense for ensuring service high availability and production environment robustness.
4.2 Testing Content and Scope¶
- Long-term Stability (Stability) Testing: Uses JSON under
tests/dfx/stability/tests/(for exampletest_qwen3_omni.jsonandtest_wan22.json) to run the service under moderate load for an extended period (e.g., over 12 hours), monitoring whether metrics like memory/VRAM usage, response time, and throughput degrade over time, and whether the service process remains stable. - Reliability Testing: Uses pytest suites under
tests/dfx/reliability/to inject controlled faults against a livevllm_omni serveinstance (sameomni_server/omni_server_functionfixture style as E2E). Current suites emphasize GPU memory pressure (CUDA sidecar “memory hog”), worker / runtime process kill (SIGKILLonVLLM::Workerfor Qwen3-Omni ormultiprocessing.spawnfor Wan2.2 video workers), large multimodal chat or/v1/videosjobs under OOM,/health→ 503 and fast-fail / non-hanging concurrent requests after kill, and OpenAI-style 5xx error contracts (e.g. text vs text+audio under OOM). Post-fault recovery checks exist where enabled (some cases may beskipwhile issues are tracked). See the Reliability<details>block in Section 4.4 for file-level responsibilities and CI markers (slow,hardware_test, POSIX-only kill).
4.3 Test Directory and Execution Files¶
- Stability Test Configuration:
tests/dfx/stability/tests/test_qwen3_omni.json,tests/dfx/stability/tests/test_wan22.json(one JSON per model / runner family) - Reliability Test Suite (
tests/dfx/reliability/):test_reliability_qwen3_omni.py— Qwen3-Omni chat / multimodal reliability (GPU OOM, process kill, recovery, error contract under--async-chunkvs default).test_reliability_wan22.py— Wan2.2 T2V video API reliability (/v1/videosunder OOM and process kill, recovery).helpers.py— Shared primitives used by current suites: raw HTTP probes for/v1/chat/completionsand/health, OpenAI-style error parsing, GPU OOM sidecar (inject_gpu_oom/stop_gpu_oom_hogs), andpgrep-based process-kill injector construction (make_process_kill_fault_injector).conftest.py—fault_injectorandomni_server_after_fault/omni_server_after_fault_functionfixtures to run a callable after the server is ready.README.md— Short local run commands for this directory.
4.4 Execution Method and Example¶
- Trigger Timing:
Weekly(weekly) orDays before Release(several days before a major release). Due to long execution times, the frequency is lower. - Execution Environment: GPU servers, requiring a stable and exclusive testing environment.
- Script Example:
Test Examples
When you want to add L5-level stability test cases, add or extend the appropriate JSON file under `tests/dfx/stability/tests/` (for example `test_qwen3_omni.json` for Omni bench traffic, or `test_wan22.json` for diffusion `/v1/videos` workloads). The following illustrates the Qwen3-Omni shape:{
"test_name": "test_qwen3_omni_stability",
"server_params": {
"model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
"stage_config_name": "qwen3_omni.yaml"
},
"benchmark_params": [
{
"dataset_name": "random",
"backend": "openai-chat-omni",
"endpoint": "/v1/chat/completions",
"duration_sec": 43200,
"request_rate": 0.5,
"num_prompts_per_batch": 20,
"random_input_len": 2500,
"random_output_len": 900,
"ignore_eos": true,
"percentile-metrics": "ttft,tpot,itl,e2el,audio_rtf,audio_ttfp,audio_duration"
}
]
}
Reliability test suite (tests/dfx/reliability)
#### Purpose and relationship to stability Reliability tests are **short fault-injection** integration runs (L5 **(b)** in `tests/dfx/reliability/README.md`). They complement **stability** JSON-driven long runs: instead of hours of steady traffic, they **perturb** the server (GPU OOM sidecar, fatal signals on selected processes) and check **failure mode** and **latency bounds** (e.g. chat or `/v1/videos` must not hang under concurrent fault-time load). #### Directory layout | Path | Responsibility | | ---- | -------------- | | `helpers.py` | Shared helpers used by current reliability suites: raw `POST`/`GET` probes (`/v1/chat/completions`, `/health`), OpenAI error parsing (`extract_openai_error_contract_from_bytes`), GPU OOM sidecar lifecycle (`inject_gpu_oom`, `stop_gpu_oom_hogs`), and process-kill injector builder (`make_process_kill_fault_injector`). | | `conftest.py` | Pytest fixtures: indirect `fault_injector`, `omni_server_after_fault` / `omni_server_after_fault_function` (run injector after server is ready, then yield server). | | `test_reliability_qwen3_omni.py` | Qwen3-Omni: OOM vs **text vs text+audio** error contract, large multimodal chat under OOM, concurrent pressure, **SIGKILL** on `VLLM::Worker`, `/health` → 503 + fast-fail + concurrent chat; optional OOM recovery scenario (may be skipped while tracked in issues). | | `test_reliability_wan22.py` | Wan2.2 T2V: large `/v1/videos` under OOM, **SIGKILL** on `multiprocessing.spawn` chain, health / fast-fail / concurrent video requests; optional recovery test (may be skipped). | | `README.md` | Minimal run / collect examples. | #### Parametrization and markers - Each test module defines a **`RELIABILITY_SCENARIOS`** list (`test_name`, `server_params`: model, `stage_config_name` or diffusion `server_args`, etc.). **`create_reliability_omni_server_params()`** in `tests/dfx/conftest.py` resolves stage paths (including XPU substitutions where applicable) and builds **`OmniServerParams`** lists consumed by **`@pytest.mark.parametrize(..., indirect=True)`** on `omni_server` or `omni_server_function`. - Cases are tagged **`@pytest.mark.slow`** for weekly / selective CI. GPU-heavy suites use **`@hardware_test(res={"cuda": "H100"}, num_cards=...)`** (Qwen3-Omni paths often require **2** cards; Wan2.2 video paths **1** card). - **Process-kill** tests use **`@pytest.mark.skipif(os.name == "nt", ...)`** because injection uses POSIX **`pgrep` / `kill`**. #### CI trigger Weekly Buildkite (`.buildkite/test-weekly.yml`) runs, for example: pytest -s -v tests/dfx/reliability/test_reliability_qwen3_omni.py -m "slow"
pytest -s -v tests/dfx/reliability/test_reliability_wan22.py -m "slow"
pytest --collect-only tests/dfx/reliability
pytest -s -v tests/dfx/reliability/test_reliability_qwen3_omni.py -m slow
pytest -s -v tests/dfx/reliability/test_reliability_wan22.py -m slow
-
- Stability:
pytest -s -v tests/dfx/stability/scripts/test_stability_qwen3_omni.pyorpytest -s -v tests/dfx/stability/scripts/test_stability_wan22.py(or addtest_stability_<model>.pyalongside a matching JSON config) - Reliability:
pytest -s -v tests/dfx/reliability/test_reliability_qwen3_omni.py -m slowand/orpytest -s -v tests/dfx/reliability/test_reliability_wan22.py -m slow(addtest_reliability_<suite>.pyfor new models)
- Stability:
Summary¶
This multi-level testing system achieves continuous, progressive validation of model service quality by tightly integrating testing activities with the development workflow (commit, review, merge, release). From rapid unit testing to comprehensive end-to-end testing, and further to in-depth performance, stability, and reliability verification, each level has clear objectives, collectively building a robust quality protection net. By following this system, teams can deliver high-quality, highly reliable model services more efficiently.