Step-Audio2 Online Serving¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/step_audio2.

This directory contains examples for running Step-Audio2 with vLLM-Omni's online serving API.

Installation¶

Please refer to README.md

Launch the Server¶

# Async chunk mode (recommended — lower first-packet latency for TTS)
vllm serve stepfun-ai/Step-Audio2-mini --omni --port 8092 \
    --stage-configs-path vllm_omni/model_executor/stage_configs/step_audio_2_async_chunk.yaml \
    --trust-remote-code --enforce-eager

Sequential mode:

vllm serve stepfun-ai/Step-Audio2-mini --omni --port 8092 \
    --stage-configs-path vllm_omni/model_executor/stage_configs/step_audio_2.yaml \
    --trust-remote-code --enforce-eager

With local model:

vllm serve /path/to/Step-Audio-2-mini --omni --port 8092 \
    --trust-remote-code --enforce-eager

Send Requests¶

TTS via `/v1/audio/speech` (Recommended)¶

cd examples/online_serving/step_audio2

# Python client
python openai_speech_client.py --text "你好世界"

# With custom system prompt
python openai_speech_client.py --text "Hello, how are you?" \
    --instructions "You are a friendly assistant."

# Save to specific file
python openai_speech_client.py --text "你好世界" -o output.wav

Or via curl:

curl -X POST http://localhost:8092/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"model":"stepfun-ai/Step-Audio2-mini","input":"你好世界","voice":"default"}' \
    --output output.wav

This endpoint bypasses the chat template and directly triggers TTS mode. It supports async chunk streaming for low first-packet latency.

Note: Speaker voice is controlled by STEP_AUDIO2_DEFAULT_PROMPT_WAV env var on the server side.

Chat Completions (ASR / S2ST)¶

# Audio to Text (ASR)
python openai_chat_completion_client.py --query-type audio_to_text

# Audio to Audio (S2ST)
python openai_chat_completion_client.py --query-type audio_to_audio --audio-path /path/to/input.wav

Argument	Description
`--query-type`, `-q`	Query type: `audio_to_text`, `text_to_audio`, `audio_to_audio`
`--audio-path`, `-a`	Path to input audio file (local or URL)
`--text`, `-t`	Text to synthesize (for TTS mode)
`--prompt`, `-p`	Custom prompt/question
`--output-dir`, `-o`	Output directory for audio files (default: `output_online`)
`--api-base`	API base URL (default: `http://localhost:8092/v1`)

Curl (Chat Completions)¶

# Audio to Text
bash run_curl.sh audio_to_text

# Text to Audio
bash run_curl.sh text_to_audio

# Audio to Audio
bash run_curl.sh audio_to_audio

Query Types¶

1. Audio to Text (ASR)¶

Transcribe audio to text.

python openai_chat_completion_client.py \
    --query-type audio_to_text \
    --audio-path /path/to/speech.wav \
    --prompt "Transcribe this audio."

2. Text to Audio (TTS)¶

Convert text to speech:

# Via speech endpoint (recommended, returns WAV directly)
python openai_speech_client.py --text "Hello, welcome to Step-Audio2."

# Via chat completions
python openai_chat_completion_client.py \
    --query-type text_to_audio \
    --text "Hello, welcome to Step-Audio2."

3. Audio to Audio (S2ST)¶

Process input audio and generate text transcription + audio output.

python openai_chat_completion_client.py \
    --query-type audio_to_audio \
    --audio-path /path/to/source.wav

Output¶

Text output: Printed to console
Audio output: Saved to output_online/audio_0.wav (24kHz WAV)

API Format¶

Step-Audio2 uses the OpenAI-compatible chat completions API:

{
  "model": "stepfun-ai/Step-Audio2-mini",
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "Transcribe the audio."}]
    },
    {
      "role": "user",
      "content": [
        {"type": "audio_url", "audio_url": {"url": "..."}},
        {"type": "text", "text": "Please transcribe."}
      ]
    }
  ],
  "sampling_params_list": [
    {"temperature": 0.7, "max_tokens": 1024},
    {"temperature": 0.0, "max_tokens": 1}
  ]
}

Performance¶

Async Chunk vs Sequential¶

Benchmark via /v1/audio/speech (4x RTX 3090, 10 prompts, concurrency=1):

Mode	Mean TTFP	Mean E2E	Mean RTF
Sequential	4316ms	4316ms	0.938
Async Chunk	1437ms	4362ms	0.949

Async chunk reduces TTFP by 67% by streaming audio token chunks from Thinker to Token2Wav as they are generated. RTF < 1 in both modes (real-time capable).

Troubleshooting¶

Server not responding¶

Check if the server is running: curl http://localhost:8092/v1/models
Verify the port number matches

FileNotFoundError: prompt_wav file not found¶

Ensure default_female.wav exists at {model_dir}/assets/default_female.wav
Or set STEP_AUDIO2_DEFAULT_PROMPT_WAV environment variable when launching the server

Audio not generated¶

For TTS, use the /v1/audio/speech endpoint (recommended) or openai_speech_client.py
For chat completions TTS, ensure the prompt ends with <tts_start>
Check server logs for errors

Out of memory¶

Reduce gpu_memory_utilization in stage configs
Use a smaller batch size

Example materials¶

openai_chat_completion_client.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/step_audio2/openai_chat_completion_client.py.

openai_speech_client.py

"""OpenAI-compatible client for Step-Audio2 TTS via /v1/audio/speech endpoint.

Examples:
    # Basic TTS
    python openai_speech_client.py --text "你好世界"

    # With custom system prompt
    python openai_speech_client.py --text "Hello, how are you?" \
        --instructions "You are a friendly assistant."

    # Save to specific file
    python openai_speech_client.py --text "你好世界" -o output.wav
"""

import argparse

import httpx

DEFAULT_API_BASE = "http://localhost:8092"
DEFAULT_API_KEY = "EMPTY"


def run_tts_generation(args) -> None:
    """Run TTS generation via /v1/audio/speech API."""
    payload = {
        "model": args.model,
        "input": args.text,
        "voice": args.voice,
        "response_format": args.response_format,
    }

    if args.instructions:
        payload["instructions"] = args.instructions

    print(f"Model: {args.model}")
    print(f"Text: {args.text}")
    print(f"Voice: {args.voice}")
    print("Generating audio...")

    api_url = f"{args.api_base}/v1/audio/speech"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {args.api_key}",
    }

    with httpx.Client(timeout=300.0) as client:
        response = client.post(api_url, json=payload, headers=headers)

    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        print(response.text)
        return

    try:
        text = response.content.decode("utf-8")
        if text.startswith('{"error"'):
            print(f"Error: {text}")
            return
    except UnicodeDecodeError:
        pass

    output_path = args.output or "tts_output.wav"
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"Audio saved to: {output_path}")


def parse_args():
    parser = argparse.ArgumentParser(
        description="Step-Audio2 TTS client via /v1/audio/speech",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__,
    )
    parser.add_argument(
        "--api-base",
        type=str,
        default=DEFAULT_API_BASE,
        help=f"API base URL (default: {DEFAULT_API_BASE})",
    )
    parser.add_argument(
        "--api-key",
        type=str,
        default=DEFAULT_API_KEY,
        help="API key (default: EMPTY)",
    )
    parser.add_argument(
        "--model",
        "-m",
        type=str,
        default="stepfun-ai/Step-Audio2-mini",
        help="Model name/path",
    )
    parser.add_argument(
        "--text",
        type=str,
        required=True,
        help="Text to synthesize",
    )
    parser.add_argument(
        "--voice",
        type=str,
        default="default",
        help="Voice name (default: default)",
    )
    parser.add_argument(
        "--instructions",
        type=str,
        default=None,
        help="System prompt for the Thinker stage",
    )
    parser.add_argument(
        "--response-format",
        type=str,
        default="wav",
        choices=["wav", "pcm"],
        help="Audio output format (default: wav)",
    )
    parser.add_argument(
        "--output",
        "-o",
        type=str,
        default=None,
        help="Output audio file path (default: tts_output.wav)",
    )
    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    run_tts_generation(args)

run_curl.sh

#!/usr/bin/env bash
set -euo pipefail

# Step-Audio2 curl client for online serving
# Usage: bash run_curl.sh [audio_to_text|text_to_audio|audio_to_audio]

QUERY_TYPE="${1:-audio_to_text}"
API_BASE="${API_BASE:-http://localhost:8092}"

# Validate query type
if [[ ! "$QUERY_TYPE" =~ ^(audio_to_text|text_to_audio|audio_to_audio)$ ]]; then
    echo "Error: Invalid query type '$QUERY_TYPE'"
    echo "Usage: $0 [audio_to_text|text_to_audio|audio_to_audio]"
    echo "  audio_to_text: Speech recognition (ASR)"
    echo "  text_to_audio: Text-to-speech (TTS)"
    echo "  audio_to_audio: Voice conversion"
    exit 1
fi

SEED=42

# Default test audio URL
MARY_HAD_LAMB_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"

# Sampling parameters for Thinker stage
thinker_sampling_params='{
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": -1,
  "max_tokens": 1024,
  "seed": 42,
  "detokenize": true,
  "repetition_penalty": 1.05
}'

# Sampling parameters for Token2Wav stage
token2wav_sampling_params='{
  "temperature": 0.0,
  "top_p": 1.0,
  "top_k": -1,
  "max_tokens": 1,
  "seed": 42,
  "detokenize": false
}'

# Build request based on query type
case "$QUERY_TYPE" in
  audio_to_text)
    system_content='[{"type": "text", "text": "You are a speech recognition assistant. Transcribe the audio accurately."}]'
    user_content='[
      {"type": "audio_url", "audio_url": {"url": "'"$MARY_HAD_LAMB_URL"'"}},
      {"type": "text", "text": "Please transcribe this audio."}
    ]'
    # Add stop token for ASR
    thinker_sampling_params='{
      "temperature": 0.7,
      "top_p": 0.9,
      "top_k": -1,
      "max_tokens": 1024,
      "seed": 42,
      "detokenize": true,
      "repetition_penalty": 1.05,
      "stop_token_ids": [151645]
    }'
    ;;
  text_to_audio)
    system_content='[{"type": "text", "text": "You are a text-to-speech assistant. Read the text aloud exactly as provided."}]'
    user_content='[
      {"type": "text", "text": "Hello, this is a test of Step Audio 2 text to speech synthesis.<tts_start>"}
    ]'
    thinker_sampling_params='{
      "temperature": 0.7,
      "top_p": 0.9,
      "top_k": -1,
      "max_tokens": 1024,
      "seed": 42,
      "detokenize": true,
      "repetition_penalty": 1.1
    }'
    ;;
  audio_to_audio)
    system_content='[{"type": "text", "text": "You are an audio processing assistant. Listen and repeat the audio content."}]'
    user_content='[
      {"type": "audio_url", "audio_url": {"url": "'"$MARY_HAD_LAMB_URL"'"}},
      {"type": "text", "text": "Please listen to this audio and repeat its content.<tts_start>"}
    ]'
    thinker_sampling_params='{
      "temperature": 0.7,
      "top_p": 0.9,
      "top_k": -1,
      "max_tokens": 1024,
      "seed": 42,
      "detokenize": true,
      "repetition_penalty": 1.1
    }'
    ;;
esac

sampling_params_list='[
  '"$thinker_sampling_params"',
  '"$token2wav_sampling_params"'
]'

echo "Query type: $QUERY_TYPE"
echo "API base: $API_BASE"
echo "Sending request..."
echo ""

output=$(curl -sS -X POST "${API_BASE}/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
{
  "model": "stepfun-ai/Step-Audio2-mini",
  "sampling_params_list": $sampling_params_list,
  "messages": [
    {
      "role": "system",
      "content": $system_content
    },
    {
      "role": "user",
      "content": $user_content
    }
  ]
}
EOF
)

# Extract and display text content
text_content=$(echo "$output" | jq -r '.choices[0].message.content // empty')
if [[ -n "$text_content" ]]; then
    echo "Text output: $text_content"
fi

# Check for audio content
audio_data=$(echo "$output" | jq -r '.choices[0].message.audio.data // empty')
if [[ -n "$audio_data" ]]; then
    echo "Audio output received (base64 encoded)"
    echo "To save audio, use the Python client or decode the base64 data"
fi

# Check for errors
error=$(echo "$output" | jq -r '.error // empty')
if [[ -n "$error" ]]; then
    echo "Error: $error"
fi

echo ""
echo "Done!"