Step-Audio2 Offline Inference Examples¶

Source https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/step_audio2.

This directory contains examples for running offline inference with Step-Audio2 using vLLM-Omni.

Model Overview¶

Step-Audio2 is a two-stage audio model:

Stage 0 (Thinker): Audio understanding → Text + Audio tokens
Input: Audio (16kHz)
Output: Text transcription + Audio tokens for synthesis
Stage 1 (Token2Wav): Audio synthesis
Input: Audio tokens + Speaker prompt wav
Output: Synthesized audio waveform (24kHz)

Hardware Requirements¶

Mode	GPU Configuration	VRAM Required
ASR (S2T)	1x GPU	~20-25GB
TTS/S2ST (single GPU)	1x GPU	~40-50GB
TTS/S2ST (multi GPU)	2x GPU	GPU0: ~28GB, GPU1: ~22GB

Tested on: - 1x NVIDIA H100 80GB (single-card S2ST) - 2x NVIDIA A10 40GB (multi-card S2ST)

Notes: - Single GPU mode requires high VRAM due to both stages sharing memory - Multi GPU mode separates Stage 0 (Thinker) and Stage 1 (Token2Wav) across GPUs - VRAM usage can be adjusted via gpu_memory_utilization in stage config

Performance Benchmark¶

vLLM-Omni vs Official Step-Audio2¶

Single request latency comparison between vLLM-Omni and official Step-Audio2 implementation.

Task	Tokens	vllm-omni	Step-Audio2	Speedup
S2ST	~85	5.45s	7.36s	1.35x
S2ST	~160	6.67s	13.92s	2.09x
S2ST	~315	9.42s	31.50s	3.34x
TTS	~1024	16.65s	~87s	~5.2x

Key observations: - Speedup increases with sequence length due to vLLM's efficient KV cache management - TTS (pure generation) shows the largest speedup (~5x) - S2ST benefits from optimized multi-stage pipeline

Benchmark environment: - GPU: NVIDIA H100 80GB (single card) - Model: Step-Audio2-mini - Warmup: 1 run, Measured: 3 runs (averaged)

Async Chunk Streaming Performance¶

Comparison between sequential (non-async) and async chunk modes via /v1/audio/speech TTS endpoint.

Mode	Mean TTFP	Mean E2E	Mean RTF	Audio Throughput
Sequential	4316ms	4316ms	0.938	1.07x realtime
Async Chunk	1437ms	4362ms	0.949	1.06x realtime
Improvement	-67% (3x faster)	~same	~same	~same

Key observations: - Async chunk reduces time-to-first-audio (TTFP) by 67% (4.3s → 1.4s) - E2E latency remains comparable — async chunk overlaps Thinker decode with Token2Wav synthesis - RTF < 1 in both modes (real-time capable) - Sequential mode: TTFP ≈ E2E (must wait for all audio tokens before synthesis starts) - Async chunk mode: Token2Wav starts after first 28 tokens (chunk_size=25 + lookahead=3)

Benchmark environment: - GPU: 4x NVIDIA RTX 3090 24GB (TP=2 for Thinker, 1 GPU for Token2Wav) - Model: Step-Audio2-mini - Endpoint: /v1/audio/speech (10 prompts, concurrency=1) - Measured via bench_tts_serve.py

Installation¶

Make sure you have installed vLLM-Omni and all required dependencies:

# Install vLLM-Omni
pip install vllm-omni

# Install Step-Audio2 (REQUIRED for Token2Wav stage)
pip install step-audio2

Model Setup¶

Option 1: Auto-download from HuggingFace (Recommended)¶

The script will automatically download the model on first run:

# Just run without specifying --model, it will auto-download stepfun-ai/Step-Audio2-mini
python end2end.py --query-type audio_to_text

# Or explicitly specify the HuggingFace model
python end2end.py --query-type audio_to_text --model stepfun-ai/Step-Audio2-mini

Models will be cached in ~/.cache/huggingface/hub/ for future use.

Available models: - stepfun-ai/Step-Audio2-mini (smaller, faster) - stepfun-ai/Step-Audio2-7B (larger, better quality)

Option 2: Manual Download (for offline use)¶

Download and use locally:

# Download from HuggingFace
huggingface-cli download stepfun-ai/Step-Audio2-mini --local-dir ./models/Step-Audio2-mini

# Then use the local path
python end2end.py --query-type audio_to_text --model ./models/Step-Audio2-mini

Ensure the model directory contains:

Step-Audio2-mini/
├── config.json
├── model.safetensors (or pytorch_model.bin)
├── tokenizer.json
├── tokenizer_config.json
└── token2wav/                           # Token2Wav models (REQUIRED)
    ├── speech_tokenizer_v2_25hz.onnx   # Audio tokenizer
    ├── campplus.onnx                    # Speaker encoder
    ├── flow.yaml                        # Flow model config
    ├── flow.pt                          # Flow model weights
    └── hift.pt                          # HiFT vocoder weights

Usage Examples¶

1. Audio to Text (ASR - Speech Recognition)¶

Transcribe audio to text:

# Quick start - Using default model and test audio
python end2end.py --query-type audio_to_text

# Using your own audio file (model will auto-download)
python end2end.py --query-type audio_to_text \
    --audio-path /path/to/input.wav

# With specific model
python end2end.py --query-type audio_to_text \
    --audio-path /path/to/input.wav \
    --model stepfun-ai/Step-Audio2-7B

# With custom question
python end2end.py --query-type audio_to_text \
    --audio-path input.wav \
    --question "What is the speaker saying?"

Output: Text transcription saved to output_step_audio2/00000_text.txt

2. Text to Audio (TTS - Speech Synthesis)¶

Convert text to speech:

# Basic TTS (model auto-downloads)
python end2end.py --query-type text_to_audio \
    --text "Hello, this is a test of Step Audio 2 synthesis."

# With specific model
python end2end.py --query-type text_to_audio \
    --text "Hello, this is a test." \
    --model stepfun-ai/Step-Audio2-7B

Note: Speaker voice is controlled by the STEP_AUDIO2_DEFAULT_PROMPT_WAV environment variable or the default prompt wav bundled with the model.

Output: - Text: output_step_audio2/00000_text.txt - Audio: output_step_audio2/00000_output.wav (24kHz)

3. Audio to Audio (Voice Conversion)¶

Process input audio and generate output audio:

# Basic voice conversion (model auto-downloads)
python end2end.py --query-type audio_to_audio \
    --audio-path /path/to/source_audio.wav

# With specific model
python end2end.py --query-type audio_to_audio \
    --audio-path source.wav \
    --model stepfun-ai/Step-Audio2-7B

This mode: 1. Understands the content in --audio-path (source) 2. Generates audio output with the default voice

Note: To use a custom speaker voice, set the STEP_AUDIO2_DEFAULT_PROMPT_WAV environment variable.

Advanced Options¶

# Use custom stage configuration
python end2end.py --query-type audio_to_text \
    --stage-configs-path /path/to/custom_config.yaml

# Multiple prompts (for batch testing)
python end2end.py --query-type audio_to_text \
    --audio-path input.wav \
    --num-prompts 5

# Custom output directory
python end2end.py --query-type text_to_audio \
    --text "Test synthesis" \
    --output-dir ./my_outputs

# Enable detailed logging
python end2end.py --query-type audio_to_text \
    --audio-path input.wav \
    --enable-stats

# Adjust generation parameters
python end2end.py --query-type audio_to_text \
    --audio-path input.wav \
    --max-tokens 2048

# Use custom speaker voice via environment variable
STEP_AUDIO2_DEFAULT_PROMPT_WAV=/path/to/speaker.wav python end2end.py \
    --query-type text_to_audio \
    --text "Hello world"

# Use Ray backend for distributed processing
python end2end.py --query-type text_to_audio \
    --text "Hello world" \
    --worker-backend ray \
    --ray-address "auto"

Configuration¶

Stage Configuration¶

The default configuration (step_audio_2.yaml) uses:

Stage 0 (Thinker): GPU 0, 80% memory
Stage 1 (Token2Wav): GPU 1, 30% memory

For single GPU setup, edit the config to use devices: "0" for both stages.

Sampling Parameters¶

Thinker (Stage 0):
Temperature: 0.7 (balanced creativity)
Top-p: 0.9
Max tokens: 1024 (configurable)
Token2Wav (Stage 1):
Temperature: 0.0 (deterministic)
Operates in generation mode (not sampling)

Common Issues¶

1. ImportError: No module named 's3tokenizer'¶

Solution: Install Step-Audio2 package:

pip install step-audio2

2. FileNotFoundError: prompt_wav file not found¶

Solution: Set the STEP_AUDIO2_DEFAULT_PROMPT_WAV environment variable to a valid audio file:

export STEP_AUDIO2_DEFAULT_PROMPT_WAV=/path/to/speaker.wav
python end2end.py --query-type text_to_audio --text "Hello"

Or ensure the default prompt wav (default_female.wav) exists in your model directory.

3. FileNotFoundError: token2wav models not found¶

Solution: Ensure your model directory has the complete token2wav/ subdirectory with all ONNX and PyTorch models.

4. CUDA Out of Memory¶

Solutions: - Use single GPU mode (set both stages to devices: "0") - Reduce gpu_memory_utilization in config - Reduce max_num_batched_tokens - Process fewer prompts at once

5. Model not found in registry¶

Solution: Ensure you're using vLLM-Omni's entry point with --omni flag or install vllm-omni properly:

pip install vllm-omni

Output Files¶

The script generates files in the output directory (default: output_step_audio2/):

output_step_audio2/
├── 00000_text.txt        # Text output from Thinker stage
├── 00000_output.wav      # Audio output from Token2Wav stage (24kHz)
├── 00001_text.txt        # (if multiple prompts)
└── 00001_output.wav

Performance Tips¶

First run is slow: Stage initialization takes 20-60 seconds
Single GPU: Set both stages to devices: "0" in config
Multiple prompts: Use --num-prompts N for batch testing
Ray backend: For multi-node or advanced scheduling
Logging: Use --enable-stats to debug performance issues

Speaker Voice Configuration¶

The Token2Wav stage requires a speaker prompt wav for voice conditioning. It is automatically resolved in this order:

STEP_AUDIO2_DEFAULT_PROMPT_WAV environment variable (if set)
{model_dir}/assets/default_female.wav
{model_dir}/default_female.wav

If none are found, set the environment variable explicitly:

export STEP_AUDIO2_DEFAULT_PROMPT_WAV=/path/to/speaker.wav

Guidelines for custom speaker prompt:

Duration: 3-10 seconds recommended
Quality: Clean audio, minimal background noise
Format: WAV, MP3, FLAC (will be resampled internally)
Content: Clear speech, representative of target voice

Example Workflow¶

Complete example from audio to final output:

# 1. ASR: Transcribe audio
python end2end.py --query-type audio_to_text \
    --audio-path interview.wav \
    --model ./models/Step-Audio2-7B \
    --output-dir ./outputs

# 2. Check the transcription
cat ./outputs/00000_text.txt

# 3. TTS: Synthesize new speech (with custom voice)
STEP_AUDIO2_DEFAULT_PROMPT_WAV=./speaker_samples/female_voice.wav \
python end2end.py --query-type text_to_audio \
    --text "The quick brown fox jumps over the lazy dog" \
    --model ./models/Step-Audio2-7B \
    --output-dir ./outputs

# 4. Listen to the result
# Audio saved to: ./outputs/00000_output.wav

References¶

Example materials¶

end2end.py

Large file omitted from the rendered docs. View it on GitHub: https://github.com/vllm-project/vllm-omni/blob/main/examples/offline_inference/step_audio2/end2end.py.