Skip to content

Production Metrics

vLLM-Omni exposes Prometheus metrics via the /metrics endpoint on the OpenAI-compatible API server. This page covers the text and audio surface; diffusion / image / video metrics are tracked in a follow-up PR.

vllm-omni serve Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8000 --log-stats
curl http://localhost:8000/metrics

--log-stats is required to populate metric data. Without the flag, the endpoint still returns 200 OK but the upstream vllm:* wrap is not registered at all, and the 15 vllm:omni_* families are registered as placeholders with no data written to them. This keeps the runtime cost essentially zero for deployments that don't need monitoring. With the flag, all ~80 families populate.

Metric Namespaces

Prefix Source Present when
vllm:omni_ vLLM-Omni orchestrator / audio modality / cross-stage transfer Pipeline-dependent
vllm: Upstream vLLM engine, wrapped by OmniPrometheusStatLogger to expose {stage, replica} Pipeline includes an LLM (AR) stage
http_ / process_ Uvicorn / Python runtime Always

Pipeline-Level Metrics (vllm:omni_)

Defined in vllm_omni/metrics/prometheus.py. Track request lifecycle across the full multi-stage pipeline.

Request counts

Metric Type Labels Description
vllm:omni_num_requests_running Gauge model_name Pipeline-global in-flight requests (dispatched to engine, not yet finalized)
vllm:omni_num_requests_waiting Gauge model_name Requests waiting in the Orchestrator queue
vllm:omni_requests_success_total Counter model_name, finished_reason Total requests by completion reason. finished_reason ∈ {stop, length, abort, ...} mirroring upstream vllm:request_success_total; aborts cover client disconnect / cancellation paths in addition to upstream FinishReason.ABORT

Latency

Metric Type Labels Description
vllm:omni_e2e_request_latency_s Histogram model_name Pipeline-global end-to-end request latency in seconds

Audio Modality Metrics (vllm:omni_)

Emitted at request finalize, except for audio_ttfp_s (streaming-hook at the first audio packet) and audio_underrun_s / audio_continuity_ok_total (streaming finalize, after the chunk stream is exhausted). All carry {model_name, stage, replica} plus the listed extra label.

Metric Type Extra label Description
vllm:omni_audio_ttfp_s Histogram Time from request arrival to first audio packet/frame
vllm:omni_audio_duration_s Histogram Audio content duration (audio_frames / sample_rate)
vllm:omni_audio_rtf Histogram Real-time factor (stage_gen_time_s / audio_duration_s); streaming TTS SLO red line < 1; uses RTF_BUCKETS
vllm:omni_audio_frames_total Counter Cumulative audio frame count; throughput via rate()
vllm:omni_audio_underrun_s Histogram Per-request worst-case player deficit; > 0 indicates listener heard silent gaps
vllm:omni_audio_continuity_ok_total Counter threshold_ms Incremented when the request's worst underrun stayed below threshold_ms
vllm:omni_audio_skipped_requests_total Counter reason Silent-loss counter — code2wav rejected malformed codec input and returned 200 OK with empty audio

The continuity math comes from vllm_omni/benchmarks/audio_continuity.py::compute_continuity_stats so the server-side observation aligns with the bench-side definition.

Cross-Stage Transfer Metrics (vllm:omni_)

Per-physical-transfer histograms tracking the data hop between adjacent stages. Labels {model_name, from_stage, from_replica, to_stage, to_replica} let dashboards attribute latency to specific replica edges. from_replica / to_replica are resolved from the orchestrator's sticky-routing binding (stage_pool.get_bound_replica_id(request_id)), so no extra plumbing through TransferEdgeStats is needed.

Metric Type Description
vllm:omni_transfer_size_bytes Histogram Per-transfer payload size in bytes
vllm:omni_transfer_tx_s Histogram Sender-side time (serialize + submit to connector)
vllm:omni_transfer_rx_s Histogram Receiver-side time (recv + deserialize)
vllm:omni_transfer_in_flight_s Histogram Network in-flight time (TX done → RX recv start)

vLLM Engine Metrics (vllm:)

When the pipeline includes an LLM stage, the upstream vLLM engine exposes its full set of ~37 metric families under the vllm: prefix.

vLLM-Omni wraps the upstream vllm.v1.metrics.loggers.PrometheusStatLogger with OmniPrometheusStatLogger so that the original engine single label is reshaped into stage + replica. Every vllm:* family — TTFT, ITL, TPOT, e2e latency, KV cache usage, scheduler running/waiting, request success counts, etc. — therefore gains per-(stage, replica) visibility automatically. No omni-side duplicate is needed for the text path.

# Before wrap:
vllm:num_requests_running{model_name="...", engine="1"}              3.0

# After wrap:
vllm:num_requests_running{model_name="...", stage="1", replica="0"}  2.0
vllm:num_requests_running{model_name="...", stage="1", replica="1"}  1.0

For the full list of upstream metrics, see the vLLM docs.

Metric Availability by Pipeline Type

Metric group Multi-stage LLM (Qwen3-Omni)
vllm:omni_ request tracking + latency With --log-stats
vllm:omni_ audio modality With --log-stats, if pipeline has a talker stage
vllm:omni_ transfer With --log-stats, if pipeline has ≥ 2 stages
vllm: engine metrics (per (stage, replica)) With --log-stats
vllm: MFU metrics With --log-stats --enable-mfu-metrics

Naming Convention

  • All time-bearing metrics use the _s suffix (values in seconds). Buckets are SECONDS_BUCKETS for e2e / generation-style values and SECONDS_FAST_BUCKETS (1 ms → 60 s) for the fine-grained transfer and audio-underrun values.
  • Counters use the _total suffix (auto-appended by prometheus_client).
  • Sizes use the _bytes suffix.
  • All omni-specific families are prefixed vllm:omni_. The upstream unregister_vllm_metrics() function is monkey-patched (see vllm_omni/patch.py) to a scoped version that still strips upstream vllm:* collectors so multi-engine init within one process does not crash on duplicate registration, but preserves anything prefixed vllm:omni_ / vllm_omni.
  • Text and audio first-output use distinct families (vllm:time_to_first_token_seconds reused from upstream for text; vllm:omni_audio_ttfp_s for audio) rather than a single metric with a modality label.