msModelSlim Quantization¶

Overview¶

msModelSlim is an Ascend compression toolkit for producing pre-quantized model checkpoints. In vLLM-Omni, these checkpoints run through the Ascend/NPU path with --quantization ascend.

msModelSlim is static quantization: quantized weights are generated offline before vLLM-Omni inference starts.

Hardware Support¶

Device	Support
NVIDIA Blackwell GPU (SM 100+)	❌
NVIDIA Ada/Hopper GPU (SM 89+)	❌
NVIDIA Ampere GPU (SM 80+)	❌
AMD ROCm	❌
Intel XPU	❌
Ascend NPU	✅

Legend: ✅ supported, ❌ unsupported, ⭕ not verified in this guide.

Model Type Support¶

Diffusion Model (Qwen-Image, Wan2.2)¶

Model	Base model	Scope	Hardware	Notes
Wan2.2	Wan2.2 diffusion weights	DiT or diffusion stage	Ascend NPU	Upstream msModelSlim provides a Wan2.2 quantization recipe; vLLM-Omni inference validation is not listed
Qwen-Image	`Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512`	DiT or diffusion stage	Ascend NPU	Not validated in this guide
HunyuanImage-3.0	`tencent/HunyuanImage-3.0`, `tencent/HunyuanImage-3.0-Instruct`	DiT or diffusion stage	Ascend A2/A3 NPU	Generate quantized weights with the HunyuanImage-3.0 msModelSlim adaptation

Public Hugging Face quantized weights are not available yet. Use the HunyuanImage-3.0 msModelSlim adaptation to generate the checkpoint manually.

Multi-Stage Omni/TTS Model (Qwen3-Omni, Qwen3-TTS)¶

Model	Scope	Status	Notes
Qwen3-Omni	Thinker or language-model stage	Not validated	No msModelSlim omni checkpoint path is documented
Qwen3-TTS	TTS language-model stage	Not validated	No msModelSlim TTS checkpoint path is documented

Multi-Stage Diffusion Model (BAGEL, GLM-Image)¶

Model	Scope	Status	Notes
BAGEL	Stage-specific diffusion or transformer weights	Not validated	Requires a model-specific Ascend adaptation
GLM-Image	Stage-specific diffusion or transformer weights	Not validated	Requires a model-specific Ascend adaptation

Configuration¶

Offline inference:

python text_to_image.py --model <quantized-model-path> --quantization ascend

Online serving:

vllm serve <quantized-model-path> --omni --quantization ascend

Parameters¶

Parameter	Type	Default	Description
`--quantization`	str	-	Use `ascend` for msModelSlim-produced checkpoints
`model`	str	-	Path to the quantized checkpoint generated by Ascend tooling

Example msModelSlim command for a Wan2.2 W8A8 checkpoint:

msmodelslim quant \
  --model_path /path/to/wan2_2_t2v_float_weights \
  --save_path /path/to/wan2_2_t2v_quantized_weights \
  --device npu \
  --model_type Wan2_2 \
  --config_path /path/to/wan2_2_w8a8f8_mxfp_t2v.yaml \
  --trust_remote_code True

For HunyuanImage-3.0, use the Hunyuan-specific adaptation linked above.

Validation and Notes¶

Run with the Ascend/NPU installation and environment.
The ascend quantization method expects weights produced by the Ascend tooling; it is not a load-time CUDA quantizer.
Keep the quantized checkpoint aligned with the same model architecture and stage config used for BF16 inference.