Skip to content

LLM Compressor Docs

Qwen3.5

vllm-project/llm-compressor

LLM Compressor Docs

vllm-project/llm-compressor

Home
Why use LLM Compressor?
Compressing your model, step-by-step
Compressing your model, step-by-step
Deploying with vLLM
Getting started
Getting started
- Installing LLM Compressor
Key Models
Key Models
- DeepSeek V4
  DeepSeek V4
  - NVFP4 + FP8 Example
- Kimi-K2.6
  Kimi-K2.6
  - NVFP4 Example
  - FP8 Block Example
- Qwen3.5
  Qwen3.5
  - NVFP4A16 VL Example
  - NVFP4 MoE Example
- Qwen3.6
  Qwen3.6
  - NVFP4 MoE Example
- Gemma 4
  Gemma 4
- Llama 4
  Llama 4
  - FP8 Example
- Mistral Large 3
  Mistral Large 3
  - FP8 Example
User Guides
User Guides
- Entrypoints
  Entrypoints
- Compression Schemes
- Observers
- Big Models and Distributed Support
  Big Models and Distributed Support
- Saving a Compressed Model
- Memory Requirements
- Runtime Performance
Developer Guides
Developer Guides
Examples
Examples
Experimental
Experimental
- Attention Quantization in LLM Compressor
- Mistral-format model compression (experimental)
Developer
Developer
API Reference
API Reference
- llmcompressor
  llmcompressor
  - args
    
    args
    
    dataset_arguments
    
    dataset_arguments
    
    model_arguments
    
    model_arguments
    
    recipe_arguments
    
    recipe_arguments
    
    utils
    
    utils
  - core
    
    core
    
    events
    
    events
    
    event
    
    event
    
    lifecycle
    
    lifecycle
    
    model_layer
    
    model_layer
    
    session
    
    session
    
    session_functions
    
    session_functions
    
    state
    
    state
  - datasets
    
    datasets
    
    utils
    
    utils
  - entrypoints
    
    entrypoints
    
    model_free
    
    model_free
    
    helpers
    
    helpers
    
    lifecycle
    
    lifecycle
    
    microscale
    
    microscale
    
    process
    
    process
    
    reindex_fused_weights
    
    reindex_fused_weights
    
    save_utils
    
    save_utils
    
    validate
    
    validate
    
    oneshot
    
    oneshot
    
    utils
    
    utils
  - logger
    
    logger
  - modeling
    
    modeling
    
    afmoe
    
    afmoe
    
    deepseek_v3
    
    deepseek_v3
    
    deepseekv32
    
    deepseekv32
    
    config
    
    config
    
    kernel
    
    kernel
    
    model
    
    model
    
    fuse
    
    fuse
    
    gemma4
    
    gemma4
    
    glm4_moe
    
    glm4_moe
    
    glm4_moe_lite
    
    glm4_moe_lite
    
    glm_moe_dsa
    
    glm_moe_dsa
    
    gpt_oss
    
    gpt_oss
    
    granite4
    
    granite4
    
    llama4
    
    llama4
    
    moe_context
    
    moe_context
    
    offset_norm
    
    offset_norm
    
    qwen3_5_moe
    
    qwen3_5_moe
    
    qwen3_moe
    
    qwen3_moe
    
    qwen3_next_moe
    
    qwen3_next_moe
    
    qwen3_vl_moe
    
    qwen3_vl_moe
  - modifiers
    
    modifiers
    
    autoround
    
    autoround
    
    base
    
    base
    
    awq
    
    awq
    
    experimental
    
    experimental
    
    factory
    
    factory
    
    gptq
    
    gptq
    
    base
    
    base
    
    gptq_quantize
    
    gptq_quantize
    
    interface
    
    interface
    
    logarithmic_equalization
    
    logarithmic_equalization
    
    base
    
    base
    
    modifier
    
    modifier
    
    obcq
    
    obcq
    
    sgpt_base
    
    sgpt_base
    
    pruning
    
    pruning
    
    constant
    
    constant
    
    base
    
    base
    
    helpers
    
    helpers
    
    magnitude
    
    magnitude
    
    base
    
    base
    
    sparsegpt
    
    sparsegpt
    
    base
    
    base
    
    sgpt_base
    
    sgpt_base
    
    sgpt_sparsify
    
    sgpt_sparsify
    
    utils
    
    utils
    
    pytorch
    
    pytorch
    
    layer_mask
    
    layer_mask
    
    mask_factory
    
    mask_factory
    
    wanda
    
    wanda
    
    base
    
    base
    
    wanda_sparsify
    
    wanda_sparsify
    
    quantization
    
    quantization
    
    calibration
    
    calibration
    
    gptq
    
    gptq
    
    group_size_validation
    
    group_size_validation
    
    quantization
    
    quantization
    
    base
    
    base
    
    mixin
    
    mixin
    
    smoothquant
    
    smoothquant
    
    base
    
    base
    
    utils
    
    utils
    
    transform
    
    transform
    
    awq
    
    awq
    
    base
    
    base
    
    dynamic_mappings
    
    dynamic_mappings
    
    mappings
    
    mappings
    
    imatrix
    
    imatrix
    
    base
    
    base
    
    quip
    
    quip
    
    base
    
    base
    
    smoothquant
    
    smoothquant
    
    base
    
    base
    
    utils
    
    utils
    
    spinquant
    
    spinquant
    
    base
    
    base
    
    mappings
    
    mappings
    
    norm_mappings
    
    norm_mappings
    
    utils
    
    utils
    
    constants
    
    constants
    
    helpers
    
    helpers
    
    hooks
    
    hooks
    
    pytorch_helpers
    
    pytorch_helpers
  - observers
    
    observers
    
    base
    
    base
    
    helpers
    
    helpers
    
    imatrix
    
    imatrix
    
    min_max
    
    min_max
    
    mse
    
    mse
  - pipelines
    
    pipelines
    
    basic
    
    basic
    
    pipeline
    
    pipeline
    
    cache
    
    cache
    
    data_free
    
    data_free
    
    pipeline
    
    pipeline
    
    independent
    
    independent
    
    pipeline
    
    pipeline
    
    registry
    
    registry
    
    sequential
    
    sequential
    
    ast_helpers
    
    ast_helpers
    
    ast_utils
    
    ast_utils
    
    auto_wrapper
    
    auto_wrapper
    
    control_flow_analyzer
    
    control_flow_analyzer
    
    name_analyzer
    
    name_analyzer
    
    helpers
    
    helpers
    
    pipeline
    
    pipeline
    
    transformers_helpers
    
    transformers_helpers
  - pytorch
    
    pytorch
    
    model_load
    
    model_load
    
    helpers
    
    helpers
    
    utils
    
    utils
    
    helpers
    
    helpers
    
    sparsification
    
    sparsification
    
    sparsification_info
    
    sparsification_info
    
    configs
    
    configs
    
    helpers
    
    helpers
    
    module_sparsification_info
    
    module_sparsification_info
  - recipe
    
    recipe
    
    metadata
    
    metadata
    
    recipe
    
    recipe
    
    utils
    
    utils
  - sentinel
    
    sentinel
  - transformers
    
    transformers
    
    compression
    
    compression
    
    compressed_tensors_utils
    
    compressed_tensors_utils
    
    sparsity_helpers
    
    sparsity_helpers
    
    data
    
    data
    
    base
    
    base
    
    c4
    
    c4
    
    cnn_dailymail
    
    cnn_dailymail
    
    custom
    
    custom
    
    data_helpers
    
    data_helpers
    
    evolcodealpaca
    
    evolcodealpaca
    
    flickr_30k
    
    flickr_30k
    
    gsm8k
    
    gsm8k
    
    open_platypus
    
    open_platypus
    
    peoples_speech
    
    peoples_speech
    
    ultrachat_200k
    
    ultrachat_200k
    
    wikitext
    
    wikitext
    
    tracing
    
    tracing
    
    debug
    
    debug
    
    utils
    
    utils
    
    helpers
    
    helpers
  - utils
    
    utils
    
    dev
    
    dev
    
    dist
    
    dist
    
    helpers
    
    helpers
    
    metric_logging
    
    metric_logging
    
    pytorch
    
    pytorch
    
    module
    
    module
    
    utils
    
    utils
    
    transformers
    
    transformers
FAQ
FAQ
- Frequently Asked Questions

Qwen3.5

Quantization examples for the Qwen3.5 family of models, including dense vision-language and sparse MoE variants.

Note: These examples require transformers >= v5, which can be installed with:
uv pip install --upgrade transformers
With this, the examples can run end-to-end.

Pre-quantized Checkpoints

Model	Format	Hugging Face Link
Qwen3.5-4B	FP8-dynamic	RedHatAI/Qwen3.5-4B-FP8-dynamic
Qwen3.5-4B	W4A16	RedHatAI/Qwen3.5-4B-quantized.w4a16
Qwen3.5-4B	W8A8	RedHatAI/Qwen3.5-4B-quantized.w8a8
Qwen3.5-9B	FP8-dynamic	RedHatAI/Qwen3.5-9B-FP8-dynamic
Qwen3.5-9B	W4A16	RedHatAI/Qwen3.5-9B-quantized.w4a16
Qwen3.5-9B	W8A8	RedHatAI/Qwen3.5-9B-quantized.w8a8
Qwen3.5-35B-A3B	FP8-dynamic	RedHatAI/Qwen3.5-35B-A3B-FP8-dynamic
Qwen3.5-122B-A10B	FP8-dynamic	RedHatAI/Qwen3.5-122B-A10B-FP8-dynamic
Qwen3.5-122B-A10B	NVFP4	RedHatAI/Qwen3.5-122B-A10B-NVFP4
Qwen3.5-397B-A17B	FP8-dynamic	RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic