Skip to content

msModelSlim Quantization

Overview

msModelSlim is an Ascend compression toolkit for producing pre-quantized model checkpoints. In vLLM-Omni, these checkpoints run through the Ascend/NPU path with --quantization ascend.

msModelSlim is static quantization: quantized weights are generated offline before vLLM-Omni inference starts.

Hardware Support

Device Support
NVIDIA Blackwell GPU (SM 100+)
NVIDIA Ada/Hopper GPU (SM 89+)
NVIDIA Ampere GPU (SM 80+)
AMD ROCm
Intel XPU
Ascend NPU

Legend: supported, unsupported, not verified in this guide.

Model Type Support

Diffusion Model (Qwen-Image, Wan2.2)

Model Base model Scope Hardware Notes
Wan2.2 Wan2.2 diffusion weights DiT or diffusion stage Ascend NPU Upstream msModelSlim provides a Wan2.2 quantization recipe; vLLM-Omni inference validation is not listed
Qwen-Image Qwen/Qwen-Image, Qwen/Qwen-Image-2512 DiT or diffusion stage Ascend NPU Not validated in this guide
HunyuanImage-3.0 tencent/HunyuanImage-3.0, tencent/HunyuanImage-3.0-Instruct DiT or diffusion stage Ascend A2/A3 NPU Generate quantized weights with the HunyuanImage-3.0 msModelSlim adaptation

Public Hugging Face quantized weights are not available yet. Use the HunyuanImage-3.0 msModelSlim adaptation to generate the checkpoint manually.

Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)

Model Scope Status Notes
Qwen3-Omni Thinker or language-model stage Not validated No msModelSlim omni checkpoint path is documented
Qwen3-TTS TTS language-model stage Not validated No msModelSlim TTS checkpoint path is documented

Multi-Stage Diffusion Model (BAGEL, GLM-Image)

Model Scope Status Notes
BAGEL Stage-specific diffusion or transformer weights Not validated Requires a model-specific Ascend adaptation
GLM-Image Stage-specific diffusion or transformer weights Not validated Requires a model-specific Ascend adaptation

Configuration

Offline inference:

python text_to_image.py --model <quantized-model-path> --quantization ascend

Online serving:

vllm serve <quantized-model-path> --omni --quantization ascend

Parameters

Parameter Type Default Description
--quantization str - Use ascend for msModelSlim-produced checkpoints
model str - Path to the quantized checkpoint generated by Ascend tooling

Example msModelSlim command for a Wan2.2 W8A8 checkpoint:

msmodelslim quant \
  --model_path /path/to/wan2_2_t2v_float_weights \
  --save_path /path/to/wan2_2_t2v_quantized_weights \
  --device npu \
  --model_type Wan2_2 \
  --config_path /path/to/wan2_2_w8a8f8_mxfp_t2v.yaml \
  --trust_remote_code True

For HunyuanImage-3.0, use the Hunyuan-specific adaptation linked above.

Validation and Notes

  1. Run with the Ascend/NPU installation and environment.
  2. The ascend quantization method expects weights produced by the Ascend tooling; it is not a load-time CUDA quantizer.
  3. Keep the quantized checkpoint aligned with the same model architecture and stage config used for BF16 inference.