AURA Omni: Online serving¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/aura_omni.

aura_omni serves AURA as a native multi-stage vLLM-Omni pipeline:

Qwen3-ASR -> AURA/Qwen3-VL -> Qwen3-TTS Talker -> Qwen3-TTS Code2Wav

The pipeline has three semantic modules, but four engine stages because the existing Qwen3-TTS implementation is natively split into Talker and Code2Wav.

Start the server with the deploy profile:

vllm serve aurateam/AURA \
  --omni \
  --port 8091 \
  --deploy-config vllm_omni/deploy/aura_omni.yaml \
  --served-model-name aurateam/AURA \
  --trust-remote-code

The deploy file sets per-stage model repos:

Stage 0 ASR: Qwen/Qwen3-ASR-1.7B
Stage 1 AURA: aurateam/AURA
Stage 2/3 TTS: Qwen/Qwen3-TTS-12Hz-1.7B-Base

For local weights, edit the model value on each stage in vllm_omni/deploy/aura_omni.yaml. The deploy profile includes pipeline: aura_omni, so the server uses this four-stage topology even when the command-line model path points at one component checkpoint.

Expected request shape:

Send microphone audio as the Stage 0 multimodal audio input.
Include video frames in the original request multi_modal_data; the asr2aura processor carries them forward to AURA.
Optional additional_information keys:
aura_system_prompt
tts_task_type
tts_language
tts_speaker
tts_instruct
tts_ref_audio
tts_ref_text
tts_x_vector_only_mode
tts_pass_token_ids

If AURA emits <|silent|>, the aura2tts processor returns no TTS request, so the TTS stages are skipped for that turn.

GPU Utilization Recommendation¶

Tune gpu_memory_utilization per stage in vllm_omni/deploy/aura_omni.yaml. Recommended baseline on one GPU for H200

Stage 0 (ASR): 0.10
Stage 1 (AURA): 0.4
Stage 2 (Qwen3-TTS Talker): 0.20
Stage 3 (Qwen3-TTS Code2Wav): 0.20

Python Client¶

python examples/online_serving/aura_omni/openai_chat_completion_client.py \
  --host localhost \
  --port 8091 \
  --model aurateam/AURA \
  --modalities text,audio

Use local media:

python examples/online_serving/aura_omni/openai_chat_completion_client.py \
  --audio-path /path/to/input.wav \
  --video-path /path/to/video.mp4 \
  --output-dir output_aura_omni_online

Base voice clone mode (default, recommended as x-vector while debugging ICL):

python examples/online_serving/aura_omni/openai_chat_completion_client.py \
  --tts-task-type Base \
  --tts-ref-audio vllm-omni/tests/assets/qwen3_tts/clone_2.wav \
  --tts-ref-text "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."

Enable AURA token-id passthrough explicitly:

python examples/online_serving/aura_omni/openai_chat_completion_client.py \
  --tts-pass-token-ids

CustomVoice mode requires stages 2 and 3 in aura_omni.yaml to point at a Qwen3-TTS CustomVoice checkpoint:

python examples/online_serving/aura_omni/openai_chat_completion_client.py \
  --tts-task-type CustomVoice \
  --tts-speaker Vivian

By default, AURA responses are passed to Qwen3-TTS as text. Set tts_pass_token_ids=true to pass AURA-generated assistant token ids directly to Qwen3-TTS instead. The processor still uses AURA token ids, when available, to estimate the Talker prompt length in the default text path.

Curl¶

cd examples/online_serving/aura_omni
bash run_curl_multimodal_generation.sh

Set PORT, MODEL, or OUTPUT_DIR to override defaults:

PORT=8666 MODEL=aurateam/AURA bash run_curl_multimodal_generation.sh
TTS_PASS_TOKEN_IDS=true PORT=8666 MODEL=aurateam/AURA bash run_curl_multimodal_generation.sh

Gradio¶

Launch the server and Gradio UI together:

cd examples/online_serving/aura_omni
bash run_gradio_demo.sh

If the server is already running:

python examples/online_serving/aura_omni/gradio_demo.py \
  --model aurateam/AURA \
  --api-base http://localhost:8091/v1

Offline¶

For offline inference, see examples/offline_inference/aura_omni.

Example materials¶

gradio_demo.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/aura_omni/gradio_demo.py.

openai_chat_completion_client.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""OpenAI-compatible client for the AURA Omni pipeline."""

from __future__ import annotations

import base64
import io
import os

import soundfile as sf
from openai import OpenAI
from vllm.assets.audio import AudioAsset
from vllm.utils.argparse_utils import FlexibleArgumentParser

from vllm_omni.model_executor.stage_input_processors.aura_omni import (
    DEFAULT_QWEN3_TTS_REF_TEXT,
    default_qwen3_tts_ref_audio_path,
)

SEED = 42
DEFAULT_MODEL = "aurateam/AURA"
DEFAULT_VIDEO_URL = "https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4"


def _encode_file(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")


def _data_url(path: str, default_mime: str) -> str:
    suffix = os.path.splitext(path)[1].lower()
    mime_by_suffix = {
        ".wav": "audio/wav",
        ".mp3": "audio/mpeg",
        ".ogg": "audio/ogg",
        ".flac": "audio/flac",
        ".m4a": "audio/mp4",
        ".mp4": "video/mp4",
        ".webm": "video/webm",
        ".mov": "video/quicktime",
        ".avi": "video/x-msvideo",
        ".mkv": "video/x-matroska",
    }
    return f"data:{mime_by_suffix.get(suffix, default_mime)};base64,{_encode_file(path)}"


def media_url(path_or_url: str | None, *, kind: str) -> str:
    if path_or_url:
        if path_or_url.startswith(("http://", "https://", "data:")):
            return path_or_url
        if not os.path.exists(path_or_url):
            raise FileNotFoundError(f"{kind} file not found: {path_or_url}")
        return _data_url(path_or_url, "audio/wav" if kind == "audio" else "video/mp4")
    if kind == "audio":
        return AudioAsset("mary_had_lamb").url
    return DEFAULT_VIDEO_URL


def sampling_params_list() -> list[dict]:
    return [
        {"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 256, "seed": SEED},
        {
            "temperature": 0.5,
            "top_p": 1.0,
            "top_k": -1,
            "max_tokens": 256,
            "seed": SEED,
            "repetition_penalty": 1.0,
        },
        {
            "temperature": 0.9,
            "top_k": 50,
            "max_tokens": 4096,
            "seed": SEED,
            "detokenize": False,
            "repetition_penalty": 1.05,
            "stop_token_ids": [2150],
        },
        {
            "temperature": 0.0,
            "top_p": 1.0,
            "top_k": -1,
            "max_tokens": 65536,
            "seed": SEED,
            "repetition_penalty": 1.0,
        },
    ]


def parse_modalities(value: str | None) -> list[str] | None:
    if not value:
        return None
    return [item.strip() for item in value.split(",") if item.strip()]


def save_response(response, output_dir: str) -> None:
    os.makedirs(output_dir, exist_ok=True)
    for idx, choice in enumerate(response.choices):
        message = choice.message
        if message.content:
            out_txt = os.path.join(output_dir, f"choice_{idx}.txt")
            with open(out_txt, "w", encoding="utf-8") as f:
                f.write(str(message.content).strip() + "\n")
            print(f"Text saved to {out_txt}")
            print(message.content)
        if getattr(message, "audio", None):
            audio_bytes = base64.b64decode(message.audio.data)
            audio_np, sample_rate = sf.read(io.BytesIO(audio_bytes))
            out_wav = os.path.join(output_dir, f"choice_{idx}.wav")
            sf.write(out_wav, audio_np, int(sample_rate), format="WAV")
            print(f"Audio saved to {out_wav}")


def main(args) -> None:
    client = OpenAI(base_url=f"http://{args.host}:{args.port}/v1", api_key="EMPTY")
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "audio_url", "audio_url": {"url": media_url(args.audio_path, kind="audio")}},
                {"type": "video_url", "video_url": {"url": media_url(args.video_path, kind="video")}},
                {"type": "text", "text": args.prompt},
            ],
        }
    ]
    response = client.chat.completions.create(
        model=args.model,
        messages=messages,
        modalities=parse_modalities(args.modalities),
        extra_body={
            "sampling_params_list": sampling_params_list(),
            "additional_information": {
                "aura_system_prompt": args.aura_system_prompt,
                "tts_task_type": args.tts_task_type,
                "tts_language": args.tts_language,
                "tts_speaker": args.tts_speaker,
                "tts_instruct": args.tts_instruct,
                "tts_ref_audio": args.tts_ref_audio,
                "tts_ref_text": args.tts_ref_text,
                "tts_x_vector_only_mode": args.tts_x_vector_only_mode,
                "tts_pass_token_ids": args.tts_pass_token_ids,
            },
        },
        timeout=args.timeout,
    )
    save_response(response, args.output_dir)


def parse_args():
    parser = FlexibleArgumentParser(description="AURA Omni online serving client")
    parser.add_argument("--host", default="localhost")
    parser.add_argument("--port", type=int, default=8091)
    parser.add_argument("--model", default=DEFAULT_MODEL)
    parser.add_argument("--audio-path", default=None, help="Audio file, URL, or data URL.")
    parser.add_argument("--video-path", default=None, help="Video file, URL, or data URL.")
    parser.add_argument(
        "--prompt",
        default="Use the audio and video together to decide whether a reply is needed. If needed, respond briefly in English.",
    )
    parser.add_argument("--modalities", default="text,audio")
    parser.add_argument("--output-dir", default="output_aura_omni_online")
    parser.add_argument(
        "--aura-system-prompt",
        default=(
            "You are receiving a live video stream where the final frame is the present moment. "
            "Respond only when a response is needed. Otherwise output '<|silent|>'. Respond in English."
        ),
    )
    parser.add_argument("--tts-task-type", default="Base", choices=["Base", "CustomVoice"])
    parser.add_argument("--tts-language", default="English")
    parser.add_argument("--tts-speaker", default="Vivian")
    parser.add_argument("--tts-instruct", default="")
    parser.add_argument(
        "--tts-ref-audio",
        default=default_qwen3_tts_ref_audio_path(),
        help="Base-mode reference audio path/URL visible to server.",
    )
    parser.add_argument(
        "--tts-ref-text",
        default=DEFAULT_QWEN3_TTS_REF_TEXT,
        help="Base-mode reference audio transcript.",
    )
    parser.add_argument(
        "--tts-x-vector-only-mode",
        action="store_true",
        help="Use speaker embedding only for Base mode (disable ICL ref_text conditioning).",
    )
    parser.add_argument(
        "--tts-pass-token-ids",
        action="store_true",
        help="Pass AURA-generated assistant token ids directly to Qwen3-TTS. Defaults to sending text.",
    )
    parser.add_argument("--timeout", type=float, default=600.0)
    return parser.parse_args()


if __name__ == "__main__":
    main(parse_args())

run_curl_multimodal_generation.sh

#!/usr/bin/env bash
set -euo pipefail

PORT="${PORT:-8091}"
MODEL="${MODEL:-aurateam/AURA}"
OUTPUT_DIR="${OUTPUT_DIR:-output_aura_omni_online}"
TTS_PASS_TOKEN_IDS="${TTS_PASS_TOKEN_IDS:-false}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VLLM_OMNI_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"
CLONE_REF_AUDIO="${VLLM_OMNI_ROOT}/tests/assets/qwen3_tts/clone_2.wav"
CLONE_REF_TEXT="Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
mkdir -p "$OUTPUT_DIR"

MARY_HAD_LAMB_AUDIO_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"
SAMPLE_VIDEO_URL="https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4"

request_body=$(cat <<EOF
{
  "model": "$MODEL",
  "modalities": ["text", "audio"],
  "sampling_params_list": [
    {"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 256, "seed": 42},
    {"temperature": 0.5, "top_p": 1.0, "top_k": -1, "max_tokens": 256, "seed": 42, "repetition_penalty": 1.0},
    {"temperature": 0.9, "top_k": 50, "max_tokens": 4096, "seed": 42, "detokenize": false, "repetition_penalty": 1.05, "stop_token_ids": [2150]},
    {"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 65536, "seed": 42, "repetition_penalty": 1.0}
  ],
  "additional_information": {
    "aura_system_prompt": "You are receiving a live video stream where the final frame is the present moment. Respond only when a response is needed. Otherwise output '<|silent|>'. Respond in English.",
    "tts_task_type": "Base",
    "tts_ref_audio": "file://${CLONE_REF_AUDIO}",
    "tts_ref_text": "${CLONE_REF_TEXT}",
    "tts_language": "English",
    "tts_speaker": "Vivian",
    "tts_instruct": "",
    "tts_pass_token_ids": ${TTS_PASS_TOKEN_IDS}
  },
  "messages": [{
    "role": "user",
    "content": [
      {"type": "audio_url", "audio_url": {"url": "$MARY_HAD_LAMB_AUDIO_URL"}},
      {"type": "video_url", "video_url": {"url": "$SAMPLE_VIDEO_URL"}},
      {"type": "text", "text": "Use the audio and video together to decide whether a reply is needed. If needed, respond briefly in English."}
    ]
  }]
}
EOF
)

response=$(curl -sS --retry 3 --retry-delay 3 --retry-connrefused \
  -X POST "http://localhost:${PORT}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "$request_body")

echo "$response" | jq '.choices[].message.content'

audio_b64=$(echo "$response" | jq -r '.choices[]?.message.audio.data // empty' | head -n 1)
if [[ -n "$audio_b64" ]]; then
  echo "$audio_b64" | base64 -d > "${OUTPUT_DIR}/aura_omni_output.wav"
  echo "Audio saved to ${OUTPUT_DIR}/aura_omni_output.wav"
fi

run_gradio_demo.sh

#!/usr/bin/env bash
set -euo pipefail

MODEL="aurateam/AURA"
SERVER_MODEL="aurateam/AURA"
DEPLOY_CONFIG="/data/yrr/vllm-omni/vllm_omni/deploy/aura_omni.yaml"
SERVER_PORT=8091
GRADIO_PORT=7862
SERVER_HOST="0.0.0.0"
GRADIO_IP="127.0.0.1"
GRADIO_SHARE=false


while [[ $# -gt 0 ]]; do
  case "$1" in
    --model) MODEL="$2"; shift 2 ;;
    --server-model) SERVER_MODEL="$2"; shift 2 ;;
    --deploy-config) DEPLOY_CONFIG="$2"; shift 2 ;;
    --server-port) SERVER_PORT="$2"; shift 2 ;;
    --gradio-port) GRADIO_PORT="$2"; shift 2 ;;
    --server-host) SERVER_HOST="$2"; shift 2 ;;
    --gradio-ip) GRADIO_IP="$2"; shift 2 ;;
    --share) GRADIO_SHARE=true; shift ;;
    --help)
      echo "Usage: $0 [--model SERVED_MODEL_NAME] [--server-model MODEL_PATH] [--deploy-config YAML] [--server-port PORT] [--gradio-port PORT] [--share]"
      exit 0
      ;;
    *) echo "Unknown option: $1"; exit 1 ;;
  esac
done

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
API_BASE="http://localhost:${SERVER_PORT}/v1"
LOG_FILE="/tmp/aura_omni_vllm_${SERVER_PORT}.log"

cleanup() {
  echo "Shutting down..."
  [[ -n "${SERVER_PID:-}" ]] && kill "$SERVER_PID" 2>/dev/null || true
  [[ -n "${GRADIO_PID:-}" ]] && kill "$GRADIO_PID" 2>/dev/null || true
}
trap cleanup SIGINT SIGTERM EXIT

vllm serve "$SERVER_MODEL" \
  --omni \
  --host "$SERVER_HOST" \
  --port "$SERVER_PORT" \
  --deploy-config "$DEPLOY_CONFIG" \
  --served-model-name "$MODEL" \
  --trust-remote-code 2>&1 | tee "$LOG_FILE" &
SERVER_PID=$!

echo "Waiting for server startup..."
for _ in $(seq 1 600); do
  if grep -q "Application startup complete" "$LOG_FILE" 2>/dev/null; then
    break
  fi
  if ! kill -0 "$SERVER_PID" 2>/dev/null; then
    echo "vLLM server exited before startup completed"
    wait "$SERVER_PID" || true
    exit 1
  fi
  sleep 1
done

# cd "$SCRIPT_DIR"
GRADIO_CMD=(python gradio_demo.py --model "$MODEL" --api-base "$API_BASE" --ip "$GRADIO_IP" --port "$GRADIO_PORT")
if [[ "$GRADIO_SHARE" == "true" ]]; then
  GRADIO_CMD+=(--share)
fi
"${GRADIO_CMD[@]}" &
GRADIO_PID=$!

echo "vLLM server: http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio demo: http://${GRADIO_IP}:${GRADIO_PORT}"
if [[ -n "${SERVER_PID:-}" ]]; then
  wait "$SERVER_PID" "$GRADIO_PID"
else
  wait "$GRADIO_PID"
fi