Compresses model weights to fewer bits — less VRAM, slight quality trade-off.
awq / gptq / squeezellm: ~4-bit (~50% VRAM saved, good quality).
fp8: 8-bit, minimal quality loss.
bitsandbytes: flexible, slightly slower throughput.
How to check what your model supports:
- Auto-detected above — this page reads the model's
config.json and pre-selects the right option if quantization is embedded.
- Open
config.json in the model folder and look for a "quantization_config" key — its quant_type or quant_method field names the format.
- The HuggingFace model card / repo name almost always states the format (e.g. "…-AWQ", "…-GPTQ").
- If none of the above are present, the model is not quantized — leave this set to none.
⚠ The model files must already be quantized in this format — vLLM cannot quantize on the fly.