Fp8 nvfp4 example

DeepSeekV4

Quantizing DeepSeekV4

DeepSeekV4 is currently supported experimentally in LLM Compressor. You can quantize the model using the following steps:

Install LLM Compressor from scratch and check out the DeepSeekV4 experimental branch

git clone https://github.com/vllm-project/llm-compressor
cd llm-compressor
git checkout deepseekv4-experimental
uv pip install -e .
uv pip install -U transformers==5.8.0

Run the example. You can replace the MODEL_ID with any of inference-optimization/DSV4-tiny-empty or RedHatAI/DeepSeek-V4-Flash-BF16
```
python3 examples/quantizing_moe/deepseek_v4_example.py
```
Convert the model. This is to account for issues with the DSV4 transformers model definition which may be fixed at a future date
```
python3 fix_dsv4_structure.py DeepSeek-V4-Flash-BF16-NVFP4-FP8-BLOCK DeepSeek-V4-Flash-NVFP4-FP8
```

Install vllm with the experimental branch. You may want to rebase on main

git clone https://github.com/neuralmagic/vllm
cd vllm
git checkout kylesayrs/deepseek-ct
uv pip install -U transformers==5.8.0
# optional:
# git rebase main
VLLM_USE_PRECOMPILED=1 uv pip install -e .

Serve with vLLM

vllm serve DeepSeek-V4-Flash-NVFP4-FP8 --tensor-parallel-size 4 --kv_cache_dtype="fp8"