GLM-ASR Usage Guide¶
This guide describes how to run GLM-ASR-Nano-2512 for automatic speech recognition.
Model Introduction¶
GLM-ASR-Nano-2512 is a robust, open-source speech recognition model with 1.5B parameters (2B model size). Designed for real-world complexity, it outperforms OpenAI Whisper V3 on multiple benchmarks while maintaining a compact size.
Key Capabilities¶
- Exceptional Dialect Support: Beyond standard Mandarin and English, the model is highly optimized for Cantonese (粤语) and other dialects, effectively bridging the gap in dialectal speech recognition.
- Low-Volume Speech Robustness: Specifically trained for "Whisper/Quiet Speech" scenarios. It captures and accurately transcribes extremely low-volume audio that traditional models often miss.
- SOTA Performance: Achieves the lowest average error rate (4.10) among comparable open-source models, showing significant advantages in Chinese benchmarks (Wenet Meeting, Aishell-1, etc.).
Installing Dependencies¶
uv venv
source .venv/bin/activate
# Install transformers from source (required)
uv pip install git+https://github.com/huggingface/transformers.git
uv pip install -U "vllm[audio]" --torch-backend auto # vllm>=0.14.1 is required
Running with vLLM¶
Start Server¶
Client Usage¶
Using OpenAI SDK¶
import base64
import httpx
from openai import OpenAI
# Initialize client
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Load audio file and encode to base64
audio_url = "https://github.com/zai-org/GLM-ASR/raw/main/examples/example_en.wav"
audio_data = base64.b64encode(httpx.get(audio_url).content).decode("utf-8")
# Create transcription request
response = client.chat.completions.create(
model="zai-org/GLM-ASR-Nano-2512",
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": audio_data,
"format": "wav"
}
}
]
}
],
max_tokens=500
)
print(response.choices[0].message.content)
Using cURL¶
# First encode audio to base64
AUDIO_BASE64=$(curl -sL "https://github.com/zai-org/GLM-ASR/raw/main/examples/example_en.wav" | base64)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "zai-org/GLM-ASR-Nano-2512",
"messages": [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "'"${AUDIO_BASE64}"'",
"format": "wav"
}
}
]
}
],
"max_tokens": 500
}'
Using Local Audio File¶
import base64
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Load local audio file
with open("your_audio.mp3", "rb") as f:
audio_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="zai-org/GLM-ASR-Nano-2512",
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": audio_data,
"format": "mp3" # or "wav", "flac", etc.
}
}
]
}
],
max_tokens=500
)
print(response.choices[0].message.content)
Using Transcribe Endpoint¶
import httpx
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Transcribe audio from URL
audio_url = "https://github.com/zai-org/GLM-ASR/raw/main/examples/example_en.wav"
audio_file = httpx.get(audio_url).content
response = client.audio.transcriptions.create(
model="zai-org/GLM-ASR-Nano-2512",
file=("audio.wav", audio_file),
)
print(response.text)
Transcribe with cURL¶
curl http://localhost:8000/v1/audio/transcriptions \
-H "Authorization: Bearer EMPTY" \
-F "model=zai-org/GLM-ASR-Nano-2512" \
-F "file=@your_audio.wav"
Notes¶
- Transformers Version: This model requires
transformers >= 5.0.0for optimal compatibility.