msModelSlim Quantization¶
Overview¶
msModelSlim is an Ascend compression toolkit for producing pre-quantized model checkpoints. In vLLM-Omni, these checkpoints run through the Ascend/NPU path with --quantization ascend.
msModelSlim is static quantization: quantized weights are generated offline before vLLM-Omni inference starts.
Hardware Support¶
| Device | Support |
|---|---|
| NVIDIA Blackwell GPU (SM 100+) | ❌ |
| NVIDIA Ada/Hopper GPU (SM 89+) | ❌ |
| NVIDIA Ampere GPU (SM 80+) | ❌ |
| AMD ROCm | ❌ |
| Intel XPU | ❌ |
| Ascend NPU | ✅ |
Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide.
Model Type Support¶
Diffusion Model (Qwen-Image, Wan2.2)¶
| Model | Base model | Scope | Hardware | Notes |
|---|---|---|---|---|
| Wan2.2 | Wan2.2 diffusion weights | DiT or diffusion stage | Ascend NPU | Upstream msModelSlim provides a Wan2.2 quantization recipe; vLLM-Omni inference validation is not listed |
| Qwen-Image | Qwen/Qwen-Image, Qwen/Qwen-Image-2512 | DiT or diffusion stage | Ascend NPU | Not validated in this guide |
| HunyuanImage-3.0 | tencent/HunyuanImage-3.0, tencent/HunyuanImage-3.0-Instruct | DiT or diffusion stage | Ascend A2/A3 NPU | Generate quantized weights with the HunyuanImage-3.0 msModelSlim adaptation |
Public Hugging Face quantized weights are not available yet. Use the HunyuanImage-3.0 msModelSlim adaptation to generate the checkpoint manually.
Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶
| Model | Scope | Status | Notes |
|---|---|---|---|
| Qwen3-Omni | Thinker or language-model stage | Not validated | No msModelSlim omni checkpoint path is documented |
| Qwen3-TTS | TTS language-model stage | Not validated | No msModelSlim TTS checkpoint path is documented |
Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶
| Model | Scope | Status | Notes |
|---|---|---|---|
| BAGEL | Stage-specific diffusion or transformer weights | Not validated | Requires a model-specific Ascend adaptation |
| GLM-Image | Stage-specific diffusion or transformer weights | Not validated | Requires a model-specific Ascend adaptation |
Configuration¶
Offline inference:
Online serving:
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
--quantization | str | - | Use ascend for msModelSlim-produced checkpoints |
model | str | - | Path to the quantized checkpoint generated by Ascend tooling |
Example msModelSlim command for a Wan2.2 W8A8 checkpoint:
msmodelslim quant \
--model_path /path/to/wan2_2_t2v_float_weights \
--save_path /path/to/wan2_2_t2v_quantized_weights \
--device npu \
--model_type Wan2_2 \
--config_path /path/to/wan2_2_w8a8f8_mxfp_t2v.yaml \
--trust_remote_code True
For HunyuanImage-3.0, use the Hunyuan-specific adaptation linked above.
Validation and Notes¶
- Run with the Ascend/NPU installation and environment.
- The
ascendquantization method expects weights produced by the Ascend tooling; it is not a load-time CUDA quantizer. - Keep the quantized checkpoint aligned with the same model architecture and stage config used for BF16 inference.