AURA Omni Native Pipeline¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/aura_omni.
aura_omni serves AURA as a native multi-stage vLLM-Omni pipeline:
The pipeline has three semantic modules, but four engine stages because the existing Qwen3-TTS implementation is natively split into Talker and Code2Wav.
Start the server with the deploy profile:
vllm serve aurateam/AURA \
--omni \
--port 8091 \
--deploy-config vllm_omni/deploy/aura_omni.yaml \
--served-model-name aurateam/AURA \
--trust-remote-code
The deploy file sets per-stage model repos:
- Stage 0 ASR:
Qwen/Qwen3-ASR-1.7B - Stage 1 AURA:
aurateam/AURA - Stage 2/3 TTS:
Qwen/Qwen3-TTS-12Hz-1.7B-Base
For local weights, edit the model value on each stage in vllm_omni/deploy/aura_omni.yaml. The deploy profile includes pipeline: aura_omni, so the server uses this four-stage topology even when the command-line model path points at one component checkpoint.
Expected request shape:
- Send microphone audio as the Stage 0 multimodal audio input.
- Include video frames in the original request
multi_modal_data; theasr2auraprocessor carries them forward to AURA. - Optional
additional_informationkeys: aura_system_prompttts_task_typetts_languagetts_speakertts_instructtts_ref_audiotts_ref_texttts_x_vector_only_modetts_pass_token_ids
If AURA emits <|silent|>, the aura2tts processor returns no TTS request, so the TTS stages are skipped for that turn.
GPU Utilization Recommendation¶
Tune gpu_memory_utilization per stage in vllm_omni/deploy/aura_omni.yaml. Recommended baseline on one GPU for H200
- Stage 0 (ASR):
0.10 - Stage 1 (AURA):
0.4 - Stage 2 (Qwen3-TTS Talker):
0.20 - Stage 3 (Qwen3-TTS Code2Wav):
0.20
Python Client¶
python examples/online_serving/aura_omni/openai_chat_completion_client.py \
--host localhost \
--port 8091 \
--model aurateam/AURA \
--modalities text,audio
Use local media:
python examples/online_serving/aura_omni/openai_chat_completion_client.py \
--audio-path /path/to/input.wav \
--video-path /path/to/video.mp4 \
--output-dir output_aura_omni_online
Base voice clone mode (default, recommended as x-vector while debugging ICL):
python examples/online_serving/aura_omni/openai_chat_completion_client.py \
--tts-task-type Base \
--tts-ref-audio vllm-omni/tests/assets/qwen3_tts/clone_2.wav \
--tts-ref-text "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
Enable AURA token-id passthrough explicitly:
CustomVoice mode requires stages 2 and 3 in aura_omni.yaml to point at a Qwen3-TTS CustomVoice checkpoint:
python examples/online_serving/aura_omni/openai_chat_completion_client.py \
--tts-task-type CustomVoice \
--tts-speaker Vivian
By default, AURA responses are passed to Qwen3-TTS as text. Set tts_pass_token_ids=true to pass AURA-generated assistant token ids directly to Qwen3-TTS instead. The processor still uses AURA token ids, when available, to estimate the Talker prompt length in the default text path.
Curl¶
Set PORT, MODEL, or OUTPUT_DIR to override defaults:
PORT=8666 MODEL=aurateam/AURA bash run_curl_multimodal_generation.sh
TTS_PASS_TOKEN_IDS=true PORT=8666 MODEL=aurateam/AURA bash run_curl_multimodal_generation.sh
Gradio¶
Launch the server and Gradio UI together:
If the server is already running:
python examples/online_serving/aura_omni/gradio_demo.py \
--model aurateam/AURA \
--api-base http://localhost:8091/v1
Offline¶
For offline inference, see examples/offline_inference/aura_omni.
Example materials¶
gradio_demo.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/aura_omni/gradio_demo.py.
openai_chat_completion_client.py
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""OpenAI-compatible client for the AURA Omni pipeline."""
from __future__ import annotations
import base64
import io
import os
import soundfile as sf
from openai import OpenAI
from vllm.assets.audio import AudioAsset
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm_omni.model_executor.stage_input_processors.aura_omni import (
DEFAULT_QWEN3_TTS_REF_TEXT,
default_qwen3_tts_ref_audio_path,
)
SEED = 42
DEFAULT_MODEL = "aurateam/AURA"
DEFAULT_VIDEO_URL = "https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4"
def _encode_file(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
def _data_url(path: str, default_mime: str) -> str:
suffix = os.path.splitext(path)[1].lower()
mime_by_suffix = {
".wav": "audio/wav",
".mp3": "audio/mpeg",
".ogg": "audio/ogg",
".flac": "audio/flac",
".m4a": "audio/mp4",
".mp4": "video/mp4",
".webm": "video/webm",
".mov": "video/quicktime",
".avi": "video/x-msvideo",
".mkv": "video/x-matroska",
}
return f"data:{mime_by_suffix.get(suffix, default_mime)};base64,{_encode_file(path)}"
def media_url(path_or_url: str | None, *, kind: str) -> str:
if path_or_url:
if path_or_url.startswith(("http://", "https://", "data:")):
return path_or_url
if not os.path.exists(path_or_url):
raise FileNotFoundError(f"{kind} file not found: {path_or_url}")
return _data_url(path_or_url, "audio/wav" if kind == "audio" else "video/mp4")
if kind == "audio":
return AudioAsset("mary_had_lamb").url
return DEFAULT_VIDEO_URL
def sampling_params_list() -> list[dict]:
return [
{"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 256, "seed": SEED},
{
"temperature": 0.5,
"top_p": 1.0,
"top_k": -1,
"max_tokens": 256,
"seed": SEED,
"repetition_penalty": 1.0,
},
{
"temperature": 0.9,
"top_k": 50,
"max_tokens": 4096,
"seed": SEED,
"detokenize": False,
"repetition_penalty": 1.05,
"stop_token_ids": [2150],
},
{
"temperature": 0.0,
"top_p": 1.0,
"top_k": -1,
"max_tokens": 65536,
"seed": SEED,
"repetition_penalty": 1.0,
},
]
def parse_modalities(value: str | None) -> list[str] | None:
if not value:
return None
return [item.strip() for item in value.split(",") if item.strip()]
def save_response(response, output_dir: str) -> None:
os.makedirs(output_dir, exist_ok=True)
for idx, choice in enumerate(response.choices):
message = choice.message
if message.content:
out_txt = os.path.join(output_dir, f"choice_{idx}.txt")
with open(out_txt, "w", encoding="utf-8") as f:
f.write(str(message.content).strip() + "\n")
print(f"Text saved to {out_txt}")
print(message.content)
if getattr(message, "audio", None):
audio_bytes = base64.b64decode(message.audio.data)
audio_np, sample_rate = sf.read(io.BytesIO(audio_bytes))
out_wav = os.path.join(output_dir, f"choice_{idx}.wav")
sf.write(out_wav, audio_np, int(sample_rate), format="WAV")
print(f"Audio saved to {out_wav}")
def main(args) -> None:
client = OpenAI(base_url=f"http://{args.host}:{args.port}/v1", api_key="EMPTY")
messages = [
{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": media_url(args.audio_path, kind="audio")}},
{"type": "video_url", "video_url": {"url": media_url(args.video_path, kind="video")}},
{"type": "text", "text": args.prompt},
],
}
]
response = client.chat.completions.create(
model=args.model,
messages=messages,
modalities=parse_modalities(args.modalities),
extra_body={
"sampling_params_list": sampling_params_list(),
"additional_information": {
"aura_system_prompt": args.aura_system_prompt,
"tts_task_type": args.tts_task_type,
"tts_language": args.tts_language,
"tts_speaker": args.tts_speaker,
"tts_instruct": args.tts_instruct,
"tts_ref_audio": args.tts_ref_audio,
"tts_ref_text": args.tts_ref_text,
"tts_x_vector_only_mode": args.tts_x_vector_only_mode,
"tts_pass_token_ids": args.tts_pass_token_ids,
},
},
timeout=args.timeout,
)
save_response(response, args.output_dir)
def parse_args():
parser = FlexibleArgumentParser(description="AURA Omni online serving client")
parser.add_argument("--host", default="localhost")
parser.add_argument("--port", type=int, default=8091)
parser.add_argument("--model", default=DEFAULT_MODEL)
parser.add_argument("--audio-path", default=None, help="Audio file, URL, or data URL.")
parser.add_argument("--video-path", default=None, help="Video file, URL, or data URL.")
parser.add_argument(
"--prompt",
default="Use the audio and video together to decide whether a reply is needed. If needed, respond briefly in English.",
)
parser.add_argument("--modalities", default="text,audio")
parser.add_argument("--output-dir", default="output_aura_omni_online")
parser.add_argument(
"--aura-system-prompt",
default=(
"You are receiving a live video stream where the final frame is the present moment. "
"Respond only when a response is needed. Otherwise output '<|silent|>'. Respond in English."
),
)
parser.add_argument("--tts-task-type", default="Base", choices=["Base", "CustomVoice"])
parser.add_argument("--tts-language", default="English")
parser.add_argument("--tts-speaker", default="Vivian")
parser.add_argument("--tts-instruct", default="")
parser.add_argument(
"--tts-ref-audio",
default=default_qwen3_tts_ref_audio_path(),
help="Base-mode reference audio path/URL visible to server.",
)
parser.add_argument(
"--tts-ref-text",
default=DEFAULT_QWEN3_TTS_REF_TEXT,
help="Base-mode reference audio transcript.",
)
parser.add_argument(
"--tts-x-vector-only-mode",
action="store_true",
help="Use speaker embedding only for Base mode (disable ICL ref_text conditioning).",
)
parser.add_argument(
"--tts-pass-token-ids",
action="store_true",
help="Pass AURA-generated assistant token ids directly to Qwen3-TTS. Defaults to sending text.",
)
parser.add_argument("--timeout", type=float, default=600.0)
return parser.parse_args()
if __name__ == "__main__":
main(parse_args())
run_curl_multimodal_generation.sh
#!/usr/bin/env bash
set -euo pipefail
PORT="${PORT:-8091}"
MODEL="${MODEL:-aurateam/AURA}"
OUTPUT_DIR="${OUTPUT_DIR:-output_aura_omni_online}"
TTS_PASS_TOKEN_IDS="${TTS_PASS_TOKEN_IDS:-false}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VLLM_OMNI_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"
CLONE_REF_AUDIO="${VLLM_OMNI_ROOT}/tests/assets/qwen3_tts/clone_2.wav"
CLONE_REF_TEXT="Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
mkdir -p "$OUTPUT_DIR"
MARY_HAD_LAMB_AUDIO_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"
SAMPLE_VIDEO_URL="https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4"
request_body=$(cat <<EOF
{
"model": "$MODEL",
"modalities": ["text", "audio"],
"sampling_params_list": [
{"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 256, "seed": 42},
{"temperature": 0.5, "top_p": 1.0, "top_k": -1, "max_tokens": 256, "seed": 42, "repetition_penalty": 1.0},
{"temperature": 0.9, "top_k": 50, "max_tokens": 4096, "seed": 42, "detokenize": false, "repetition_penalty": 1.05, "stop_token_ids": [2150]},
{"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 65536, "seed": 42, "repetition_penalty": 1.0}
],
"additional_information": {
"aura_system_prompt": "You are receiving a live video stream where the final frame is the present moment. Respond only when a response is needed. Otherwise output '<|silent|>'. Respond in English.",
"tts_task_type": "Base",
"tts_ref_audio": "file://${CLONE_REF_AUDIO}",
"tts_ref_text": "${CLONE_REF_TEXT}",
"tts_language": "English",
"tts_speaker": "Vivian",
"tts_instruct": "",
"tts_pass_token_ids": ${TTS_PASS_TOKEN_IDS}
},
"messages": [{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": "$MARY_HAD_LAMB_AUDIO_URL"}},
{"type": "video_url", "video_url": {"url": "$SAMPLE_VIDEO_URL"}},
{"type": "text", "text": "Use the audio and video together to decide whether a reply is needed. If needed, respond briefly in English."}
]
}]
}
EOF
)
response=$(curl -sS --retry 3 --retry-delay 3 --retry-connrefused \
-X POST "http://localhost:${PORT}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "$request_body")
echo "$response" | jq '.choices[].message.content'
audio_b64=$(echo "$response" | jq -r '.choices[]?.message.audio.data // empty' | head -n 1)
if [[ -n "$audio_b64" ]]; then
echo "$audio_b64" | base64 -d > "${OUTPUT_DIR}/aura_omni_output.wav"
echo "Audio saved to ${OUTPUT_DIR}/aura_omni_output.wav"
fi
run_gradio_demo.sh
#!/usr/bin/env bash
set -euo pipefail
MODEL="aurateam/AURA"
SERVER_MODEL="aurateam/AURA"
DEPLOY_CONFIG="/data/yrr/vllm-omni/vllm_omni/deploy/aura_omni.yaml"
SERVER_PORT=8091
GRADIO_PORT=7862
SERVER_HOST="0.0.0.0"
GRADIO_IP="127.0.0.1"
GRADIO_SHARE=false
while [[ $# -gt 0 ]]; do
case "$1" in
--model) MODEL="$2"; shift 2 ;;
--server-model) SERVER_MODEL="$2"; shift 2 ;;
--deploy-config) DEPLOY_CONFIG="$2"; shift 2 ;;
--server-port) SERVER_PORT="$2"; shift 2 ;;
--gradio-port) GRADIO_PORT="$2"; shift 2 ;;
--server-host) SERVER_HOST="$2"; shift 2 ;;
--gradio-ip) GRADIO_IP="$2"; shift 2 ;;
--share) GRADIO_SHARE=true; shift ;;
--help)
echo "Usage: $0 [--model SERVED_MODEL_NAME] [--server-model MODEL_PATH] [--deploy-config YAML] [--server-port PORT] [--gradio-port PORT] [--share]"
exit 0
;;
*) echo "Unknown option: $1"; exit 1 ;;
esac
done
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
API_BASE="http://localhost:${SERVER_PORT}/v1"
LOG_FILE="/tmp/aura_omni_vllm_${SERVER_PORT}.log"
cleanup() {
echo "Shutting down..."
[[ -n "${SERVER_PID:-}" ]] && kill "$SERVER_PID" 2>/dev/null || true
[[ -n "${GRADIO_PID:-}" ]] && kill "$GRADIO_PID" 2>/dev/null || true
}
trap cleanup SIGINT SIGTERM EXIT
vllm serve "$SERVER_MODEL" \
--omni \
--host "$SERVER_HOST" \
--port "$SERVER_PORT" \
--deploy-config "$DEPLOY_CONFIG" \
--served-model-name "$MODEL" \
--trust-remote-code 2>&1 | tee "$LOG_FILE" &
SERVER_PID=$!
echo "Waiting for server startup..."
for _ in $(seq 1 600); do
if grep -q "Application startup complete" "$LOG_FILE" 2>/dev/null; then
break
fi
if ! kill -0 "$SERVER_PID" 2>/dev/null; then
echo "vLLM server exited before startup completed"
wait "$SERVER_PID" || true
exit 1
fi
sleep 1
done
# cd "$SCRIPT_DIR"
GRADIO_CMD=(python gradio_demo.py --model "$MODEL" --api-base "$API_BASE" --ip "$GRADIO_IP" --port "$GRADIO_PORT")
if [[ "$GRADIO_SHARE" == "true" ]]; then
GRADIO_CMD+=(--share)
fi
"${GRADIO_CMD[@]}" &
GRADIO_PID=$!
echo "vLLM server: http://${SERVER_HOST}:${SERVER_PORT}"
echo "Gradio demo: http://${GRADIO_IP}:${GRADIO_PORT}"
if [[ -n "${SERVER_PID:-}" ]]; then
wait "$SERVER_PID" "$GRADIO_PID"
else
wait "$GRADIO_PID"
fi