Qwen3-ASR-1.7B#
Introduction#
The released Qwen3-ASR-1.7B is a lightweight, high-performance automatic speech recognition (ASR) model developed by the Qwen Team. It delivers industry-leading recognition accuracy across Chinese/English multi-scene speech, Chinese dialects, multilingual and singing voice scenarios, with native support for long audio and streaming inference, and deep optimization for Ascend NPU hardware.
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation.
Environment Preparation#
Model Weight#
Qwen3-ASR-1.7B(BF16 version): requires 1 Ascend 910B (with 1 x 64G NPUs). Download model weight
It is recommended to download the model weight to the shared directory of multiple nodes, such as /root/.cache/
Installation#
Qwen3-ASR-1.7B is supported in vllm-ascend.
You can use our official docker image to run Qwen3-ASR-1.7B directly.
export IMAGE=quay.io/ascend/vllm-ascend:v0.20.2rc1
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /data/vllm-workspace/models:/data/vllm-workspace/models \
-p 8000:8000 \
-it $IMAGE bash
In addition, if you don’t want to use the docker image as above, you can also build all from source:
Install
vllm-ascendfrom source, refer to installation.
Deployment#
vllm serve "Qwen/Qwen3-ASR-1.7B" \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--enforce-eager \
--port 8000
Functional Verification#
Once your server is started, you can query the model with input prompts:
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"messages": [
{"role": "user", "content": [
{"type": "audio_url",
"audio_url":
{"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"}}
]}
]
}'
Accuracy Evaluation#
After all samples were processed, transcription quality was measured using:
WER (Word Error Rate) for word-level recognition accuracy
CER (Character Error Rate) for character-level recognition accuracy
The current evaluation results are:
Category |
Dataset |
Metric |
Result |
|---|---|---|---|
Accuracy |
librispeech_asr / clean / test |
Total Samples |
500 |
Accuracy |
librispeech_asr / clean / test |
Success |
500 |
Accuracy |
librispeech_asr / clean / test |
Failure |
0 |
Accuracy |
librispeech_asr / clean / test |
WER |
0.035 |
Performance#
Baseline Result#
In the current evaluation, Qwen3-ASR-1.7B processed 100 samples in approximately 57 seconds, achieving an average throughput of 1.73 samples/s under the current online serving setup.
Category |
Dataset |
Metric |
Result |
|---|---|---|---|
Performance |
LibriSpeech test/clean (100 samples) |
Total Samples |
100 |
Performance |
LibriSpeech test/clean (100 samples) |
Total Runtime |
57 s |
Performance |
LibriSpeech test/clean (100 samples) |
Average Throughput |
1.73 samples/s |
Remarks#
This result reflects end-to-end serving performance, including audio preprocessing, request construction, API communication, inference, and response parsing. Actual performance may vary depending on hardware, concurrency, audio length, and deployment configuration.
Further benchmarking is recommended for latency distribution, concurrent throughput, long-audio scenarios, and system resource utilization.