Step-Audio2 Online Serving¶
Source https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/step_audio2.
This directory contains examples for running Step-Audio2 with vLLM-Omni's online serving API.
Installation¶
Please refer to README.md
Launch the Server¶
# Async chunk mode (recommended — lower first-packet latency for TTS)
vllm serve stepfun-ai/Step-Audio2-mini --omni --port 8092 \
--stage-configs-path vllm_omni/model_executor/stage_configs/step_audio_2_async_chunk.yaml \
--trust-remote-code --enforce-eager
Sequential mode:
vllm serve stepfun-ai/Step-Audio2-mini --omni --port 8092 \
--stage-configs-path vllm_omni/model_executor/stage_configs/step_audio_2.yaml \
--trust-remote-code --enforce-eager
With local model:
Send Requests¶
TTS via /v1/audio/speech (Recommended)¶
cd examples/online_serving/step_audio2
# Python client
python openai_speech_client.py --text "你好世界"
# With custom system prompt
python openai_speech_client.py --text "Hello, how are you?" \
--instructions "You are a friendly assistant."
# Save to specific file
python openai_speech_client.py --text "你好世界" -o output.wav
Or via curl:
curl -X POST http://localhost:8092/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"stepfun-ai/Step-Audio2-mini","input":"你好世界","voice":"default"}' \
--output output.wav
This endpoint bypasses the chat template and directly triggers TTS mode. It supports async chunk streaming for low first-packet latency.
Note: Speaker voice is controlled by STEP_AUDIO2_DEFAULT_PROMPT_WAV env var on the server side.
Chat Completions (ASR / S2ST)¶
# Audio to Text (ASR)
python openai_chat_completion_client.py --query-type audio_to_text
# Audio to Audio (S2ST)
python openai_chat_completion_client.py --query-type audio_to_audio --audio-path /path/to/input.wav
| Argument | Description |
|---|---|
--query-type, -q | Query type: audio_to_text, text_to_audio, audio_to_audio |
--audio-path, -a | Path to input audio file (local or URL) |
--text, -t | Text to synthesize (for TTS mode) |
--prompt, -p | Custom prompt/question |
--output-dir, -o | Output directory for audio files (default: output_online) |
--api-base | API base URL (default: http://localhost:8092/v1) |
Curl (Chat Completions)¶
# Audio to Text
bash run_curl.sh audio_to_text
# Text to Audio
bash run_curl.sh text_to_audio
# Audio to Audio
bash run_curl.sh audio_to_audio
Query Types¶
1. Audio to Text (ASR)¶
Transcribe audio to text.
python openai_chat_completion_client.py \
--query-type audio_to_text \
--audio-path /path/to/speech.wav \
--prompt "Transcribe this audio."
2. Text to Audio (TTS)¶
Convert text to speech:
# Via speech endpoint (recommended, returns WAV directly)
python openai_speech_client.py --text "Hello, welcome to Step-Audio2."
# Via chat completions
python openai_chat_completion_client.py \
--query-type text_to_audio \
--text "Hello, welcome to Step-Audio2."
3. Audio to Audio (S2ST)¶
Process input audio and generate text transcription + audio output.
python openai_chat_completion_client.py \
--query-type audio_to_audio \
--audio-path /path/to/source.wav
Output¶
- Text output: Printed to console
- Audio output: Saved to
output_online/audio_0.wav(24kHz WAV)
API Format¶
Step-Audio2 uses the OpenAI-compatible chat completions API:
{
"model": "stepfun-ai/Step-Audio2-mini",
"messages": [
{
"role": "system",
"content": [{"type": "text", "text": "Transcribe the audio."}]
},
{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": "..."}},
{"type": "text", "text": "Please transcribe."}
]
}
],
"sampling_params_list": [
{"temperature": 0.7, "max_tokens": 1024},
{"temperature": 0.0, "max_tokens": 1}
]
}
Performance¶
Async Chunk vs Sequential¶
Benchmark via /v1/audio/speech (4x RTX 3090, 10 prompts, concurrency=1):
| Mode | Mean TTFP | Mean E2E | Mean RTF |
|---|---|---|---|
| Sequential | 4316ms | 4316ms | 0.938 |
| Async Chunk | 1437ms | 4362ms | 0.949 |
Async chunk reduces TTFP by 67% by streaming audio token chunks from Thinker to Token2Wav as they are generated. RTF < 1 in both modes (real-time capable).
Troubleshooting¶
Server not responding¶
- Check if the server is running:
curl http://localhost:8092/v1/models - Verify the port number matches
FileNotFoundError: prompt_wav file not found¶
- Ensure
default_female.wavexists at{model_dir}/assets/default_female.wav - Or set
STEP_AUDIO2_DEFAULT_PROMPT_WAVenvironment variable when launching the server
Audio not generated¶
- For TTS, use the
/v1/audio/speechendpoint (recommended) oropenai_speech_client.py - For chat completions TTS, ensure the prompt ends with
<tts_start> - Check server logs for errors
Out of memory¶
- Reduce
gpu_memory_utilizationin stage configs - Use a smaller batch size
Example materials¶
openai_chat_completion_client.py
Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/online_serving/step_audio2/openai_chat_completion_client.py.
openai_speech_client.py
"""OpenAI-compatible client for Step-Audio2 TTS via /v1/audio/speech endpoint.
Examples:
# Basic TTS
python openai_speech_client.py --text "你好世界"
# With custom system prompt
python openai_speech_client.py --text "Hello, how are you?" \
--instructions "You are a friendly assistant."
# Save to specific file
python openai_speech_client.py --text "你好世界" -o output.wav
"""
import argparse
import httpx
DEFAULT_API_BASE = "http://localhost:8092"
DEFAULT_API_KEY = "EMPTY"
def run_tts_generation(args) -> None:
"""Run TTS generation via /v1/audio/speech API."""
payload = {
"model": args.model,
"input": args.text,
"voice": args.voice,
"response_format": args.response_format,
}
if args.instructions:
payload["instructions"] = args.instructions
print(f"Model: {args.model}")
print(f"Text: {args.text}")
print(f"Voice: {args.voice}")
print("Generating audio...")
api_url = f"{args.api_base}/v1/audio/speech"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {args.api_key}",
}
with httpx.Client(timeout=300.0) as client:
response = client.post(api_url, json=payload, headers=headers)
if response.status_code != 200:
print(f"Error: {response.status_code}")
print(response.text)
return
try:
text = response.content.decode("utf-8")
if text.startswith('{"error"'):
print(f"Error: {text}")
return
except UnicodeDecodeError:
pass
output_path = args.output or "tts_output.wav"
with open(output_path, "wb") as f:
f.write(response.content)
print(f"Audio saved to: {output_path}")
def parse_args():
parser = argparse.ArgumentParser(
description="Step-Audio2 TTS client via /v1/audio/speech",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
parser.add_argument(
"--api-base",
type=str,
default=DEFAULT_API_BASE,
help=f"API base URL (default: {DEFAULT_API_BASE})",
)
parser.add_argument(
"--api-key",
type=str,
default=DEFAULT_API_KEY,
help="API key (default: EMPTY)",
)
parser.add_argument(
"--model",
"-m",
type=str,
default="stepfun-ai/Step-Audio2-mini",
help="Model name/path",
)
parser.add_argument(
"--text",
type=str,
required=True,
help="Text to synthesize",
)
parser.add_argument(
"--voice",
type=str,
default="default",
help="Voice name (default: default)",
)
parser.add_argument(
"--instructions",
type=str,
default=None,
help="System prompt for the Thinker stage",
)
parser.add_argument(
"--response-format",
type=str,
default="wav",
choices=["wav", "pcm"],
help="Audio output format (default: wav)",
)
parser.add_argument(
"--output",
"-o",
type=str,
default=None,
help="Output audio file path (default: tts_output.wav)",
)
return parser.parse_args()
if __name__ == "__main__":
args = parse_args()
run_tts_generation(args)
run_curl.sh
#!/usr/bin/env bash
set -euo pipefail
# Step-Audio2 curl client for online serving
# Usage: bash run_curl.sh [audio_to_text|text_to_audio|audio_to_audio]
QUERY_TYPE="${1:-audio_to_text}"
API_BASE="${API_BASE:-http://localhost:8092}"
# Validate query type
if [[ ! "$QUERY_TYPE" =~ ^(audio_to_text|text_to_audio|audio_to_audio)$ ]]; then
echo "Error: Invalid query type '$QUERY_TYPE'"
echo "Usage: $0 [audio_to_text|text_to_audio|audio_to_audio]"
echo " audio_to_text: Speech recognition (ASR)"
echo " text_to_audio: Text-to-speech (TTS)"
echo " audio_to_audio: Voice conversion"
exit 1
fi
SEED=42
# Default test audio URL
MARY_HAD_LAMB_URL="https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/mary_had_lamb.ogg"
# Sampling parameters for Thinker stage
thinker_sampling_params='{
"temperature": 0.7,
"top_p": 0.9,
"top_k": -1,
"max_tokens": 1024,
"seed": 42,
"detokenize": true,
"repetition_penalty": 1.05
}'
# Sampling parameters for Token2Wav stage
token2wav_sampling_params='{
"temperature": 0.0,
"top_p": 1.0,
"top_k": -1,
"max_tokens": 1,
"seed": 42,
"detokenize": false
}'
# Build request based on query type
case "$QUERY_TYPE" in
audio_to_text)
system_content='[{"type": "text", "text": "You are a speech recognition assistant. Transcribe the audio accurately."}]'
user_content='[
{"type": "audio_url", "audio_url": {"url": "'"$MARY_HAD_LAMB_URL"'"}},
{"type": "text", "text": "Please transcribe this audio."}
]'
# Add stop token for ASR
thinker_sampling_params='{
"temperature": 0.7,
"top_p": 0.9,
"top_k": -1,
"max_tokens": 1024,
"seed": 42,
"detokenize": true,
"repetition_penalty": 1.05,
"stop_token_ids": [151645]
}'
;;
text_to_audio)
system_content='[{"type": "text", "text": "You are a text-to-speech assistant. Read the text aloud exactly as provided."}]'
user_content='[
{"type": "text", "text": "Hello, this is a test of Step Audio 2 text to speech synthesis.<tts_start>"}
]'
thinker_sampling_params='{
"temperature": 0.7,
"top_p": 0.9,
"top_k": -1,
"max_tokens": 1024,
"seed": 42,
"detokenize": true,
"repetition_penalty": 1.1
}'
;;
audio_to_audio)
system_content='[{"type": "text", "text": "You are an audio processing assistant. Listen and repeat the audio content."}]'
user_content='[
{"type": "audio_url", "audio_url": {"url": "'"$MARY_HAD_LAMB_URL"'"}},
{"type": "text", "text": "Please listen to this audio and repeat its content.<tts_start>"}
]'
thinker_sampling_params='{
"temperature": 0.7,
"top_p": 0.9,
"top_k": -1,
"max_tokens": 1024,
"seed": 42,
"detokenize": true,
"repetition_penalty": 1.1
}'
;;
esac
sampling_params_list='[
'"$thinker_sampling_params"',
'"$token2wav_sampling_params"'
]'
echo "Query type: $QUERY_TYPE"
echo "API base: $API_BASE"
echo "Sending request..."
echo ""
output=$(curl -sS -X POST "${API_BASE}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"model": "stepfun-ai/Step-Audio2-mini",
"sampling_params_list": $sampling_params_list,
"messages": [
{
"role": "system",
"content": $system_content
},
{
"role": "user",
"content": $user_content
}
]
}
EOF
)
# Extract and display text content
text_content=$(echo "$output" | jq -r '.choices[0].message.content // empty')
if [[ -n "$text_content" ]]; then
echo "Text output: $text_content"
fi
# Check for audio content
audio_data=$(echo "$output" | jq -r '.choices[0].message.audio.data // empty')
if [[ -n "$audio_data" ]]; then
echo "Audio output received (base64 encoded)"
echo "To save audio, use the Python client or decode the base64 data"
fi
# Check for errors
error=$(echo "$output" | jq -r '.error // empty')
if [[ -n "$error" ]]; then
echo "Error: $error"
fi
echo ""
echo "Done!"