Skip to content

Speech-to-Text (STT)

vllm-metal supports OpenAI-compatible Speech-to-Text using Whisper and Qwen3-ASR models, running natively on Apple Silicon via MLX.

Installation

First, install vllm-metal using the install script (see Installation):

./install.sh

Then install the optional STT dependencies inside the virtual environment:

source .venv-vllm-metal/bin/activate
pip install 'vllm-metal[stt]'

ffmpeg (Optional)

ffmpeg is only needed for non-WAV audio formats (mp3, m4a, flac, etc.):

# macOS
brew install ffmpeg

# Not required for WAV files - librosa handles those directly

Quick Start

# Start server with a Whisper model
vllm serve openai/whisper-small --port 8000

# Transcribe audio
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "[email protected]"

Supported Models

Whisper

Any OpenAI Whisper checkpoint (HuggingFace or MLX format):

Model Parameters HuggingFace ID
Whisper Tiny 39M openai/whisper-tiny
Whisper Base 74M openai/whisper-base
Whisper Small 244M openai/whisper-small
Whisper Medium 769M openai/whisper-medium
Whisper Large V3 1.5B openai/whisper-large-v3
Whisper Large V3 Turbo 809M openai/whisper-large-v3-turbo

MLX-format weights (e.g. from mlx-community) are also supported.

Qwen3-ASR

Model Parameters HuggingFace ID
Qwen3-ASR-0.6B 0.6B Qwen/Qwen3-ASR-0.6B

Qwen3-ASR is transcription-only: translation (/v1/audio/translations) is not supported. The model auto-detects language; language and prompt parameters are ignored.

API Endpoints

POST /v1/audio/transcriptions

Transcribe audio to text.

Parameter Type Default Description
file file required Audio file (wav, mp3, m4a, etc.)
model string "whisper" Model identifier
language string null ISO 639-1 language code (e.g. en, zh)
prompt string null Guide transcription (e.g. proper nouns)
response_format string "json" json, text, or verbose_json

POST /v1/audio/translations

Translate audio to English. Same parameters as transcriptions (except language).

Response Formats

json (default):

{"text": "Hello, world."}

verbose_json:

{
  "text": "Hello, world.",
  "language": "en",
  "duration": 2.5
}

text: Plain text output.