Production Metrics¶
vLLM-Omni exposes Prometheus metrics via the /metrics endpoint on the OpenAI-compatible API server. This page covers the text and audio surface; diffusion / image / video metrics are tracked in a follow-up PR.
vllm-omni serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8000 --log-stats
curl http://localhost:8000/metrics
--log-stats is required to populate metric data. Without the flag, the endpoint still returns 200 OK but the upstream vllm:* wrap is not registered at all, and the 15 vllm:omni_* families are registered as placeholders with no data written to them. This keeps the runtime cost essentially zero for deployments that don't need monitoring. With the flag, all ~80 families populate.
Metric Namespaces¶
| Prefix | Source | Present when |
|---|---|---|
vllm:omni_ | vLLM-Omni orchestrator / audio modality / cross-stage transfer | Pipeline-dependent |
vllm: | Upstream vLLM engine, wrapped by OmniPrometheusStatLogger to expose {stage, replica} | Pipeline includes an LLM (AR) stage |
http_ / process_ | Uvicorn / Python runtime | Always |
Pipeline-Level Metrics (vllm:omni_)¶
Defined in vllm_omni/metrics/prometheus.py. Track request lifecycle across the full multi-stage pipeline.
Request counts¶
| Metric | Type | Labels | Description |
|---|---|---|---|
vllm:omni_num_requests_running | Gauge | model_name | Pipeline-global in-flight requests (dispatched to engine, not yet finalized) |
vllm:omni_num_requests_waiting | Gauge | model_name | Requests waiting in the Orchestrator queue |
vllm:omni_requests_success_total | Counter | model_name, finished_reason | Total requests by completion reason. finished_reason ∈ {stop, length, abort, ...} mirroring upstream vllm:request_success_total; aborts cover client disconnect / cancellation paths in addition to upstream FinishReason.ABORT |
Latency¶
| Metric | Type | Labels | Description |
|---|---|---|---|
vllm:omni_e2e_request_latency_s | Histogram | model_name | Pipeline-global end-to-end request latency in seconds |
Audio Modality Metrics (vllm:omni_)¶
Emitted at request finalize, except for audio_ttfp_s (streaming-hook at the first audio packet) and audio_underrun_s / audio_continuity_ok_total (streaming finalize, after the chunk stream is exhausted). All carry {model_name, stage, replica} plus the listed extra label.
| Metric | Type | Extra label | Description |
|---|---|---|---|
vllm:omni_audio_ttfp_s | Histogram | — | Time from request arrival to first audio packet/frame |
vllm:omni_audio_duration_s | Histogram | — | Audio content duration (audio_frames / sample_rate) |
vllm:omni_audio_rtf | Histogram | — | Real-time factor (stage_gen_time_s / audio_duration_s); streaming TTS SLO red line < 1; uses RTF_BUCKETS |
vllm:omni_audio_frames_total | Counter | — | Cumulative audio frame count; throughput via rate() |
vllm:omni_audio_underrun_s | Histogram | — | Per-request worst-case player deficit; > 0 indicates listener heard silent gaps |
vllm:omni_audio_continuity_ok_total | Counter | threshold_ms | Incremented when the request's worst underrun stayed below threshold_ms |
vllm:omni_audio_skipped_requests_total | Counter | reason | Silent-loss counter — code2wav rejected malformed codec input and returned 200 OK with empty audio |
The continuity math comes from vllm_omni/benchmarks/audio_continuity.py::compute_continuity_stats so the server-side observation aligns with the bench-side definition.
Cross-Stage Transfer Metrics (vllm:omni_)¶
Per-physical-transfer histograms tracking the data hop between adjacent stages. Labels {model_name, from_stage, from_replica, to_stage, to_replica} let dashboards attribute latency to specific replica edges. from_replica / to_replica are resolved from the orchestrator's sticky-routing binding (stage_pool.get_bound_replica_id(request_id)), so no extra plumbing through TransferEdgeStats is needed.
| Metric | Type | Description |
|---|---|---|
vllm:omni_transfer_size_bytes | Histogram | Per-transfer payload size in bytes |
vllm:omni_transfer_tx_s | Histogram | Sender-side time (serialize + submit to connector) |
vllm:omni_transfer_rx_s | Histogram | Receiver-side time (recv + deserialize) |
vllm:omni_transfer_in_flight_s | Histogram | Network in-flight time (TX done → RX recv start) |
vLLM Engine Metrics (vllm:)¶
When the pipeline includes an LLM stage, the upstream vLLM engine exposes its full set of ~37 metric families under the vllm: prefix.
vLLM-Omni wraps the upstream vllm.v1.metrics.loggers.PrometheusStatLogger with OmniPrometheusStatLogger so that the original engine single label is reshaped into stage + replica. Every vllm:* family — TTFT, ITL, TPOT, e2e latency, KV cache usage, scheduler running/waiting, request success counts, etc. — therefore gains per-(stage, replica) visibility automatically. No omni-side duplicate is needed for the text path.
# Before wrap:
vllm:num_requests_running{model_name="...", engine="1"} 3.0
# After wrap:
vllm:num_requests_running{model_name="...", stage="1", replica="0"} 2.0
vllm:num_requests_running{model_name="...", stage="1", replica="1"} 1.0
For the full list of upstream metrics, see the vLLM docs.
Metric Availability by Pipeline Type¶
| Metric group | Multi-stage LLM (Qwen3-Omni) |
|---|---|
vllm:omni_ request tracking + latency | With --log-stats |
vllm:omni_ audio modality | With --log-stats, if pipeline has a talker stage |
vllm:omni_ transfer | With --log-stats, if pipeline has ≥ 2 stages |
vllm: engine metrics (per (stage, replica)) | With --log-stats |
vllm: MFU metrics | With --log-stats --enable-mfu-metrics |
Naming Convention¶
- All time-bearing metrics use the
_ssuffix (values in seconds). Buckets areSECONDS_BUCKETSfor e2e / generation-style values andSECONDS_FAST_BUCKETS(1 ms → 60 s) for the fine-grained transfer and audio-underrun values. - Counters use the
_totalsuffix (auto-appended byprometheus_client). - Sizes use the
_bytessuffix. - All omni-specific families are prefixed
vllm:omni_. The upstreamunregister_vllm_metrics()function is monkey-patched (seevllm_omni/patch.py) to a scoped version that still strips upstreamvllm:*collectors so multi-engine init within one process does not crash on duplicate registration, but preserves anything prefixedvllm:omni_/vllm_omni. - Text and audio first-output use distinct families (
vllm:time_to_first_token_secondsreused from upstream for text;vllm:omni_audio_ttfp_sfor audio) rather than a single metric with amodalitylabel.