QLoRA, published by Dettmers et al. in 2023, combines LoRA with 4-bit Quantization to push fine-tuning's GPU memory cost down even further. It keeps the base model in 4-bit precision while training the small LoRA adaptors at full precision, which made it possible to Fine-tuning 65-billion-parameter LLaMA models on a single 48GB GPU. The paper introduced new data types like NF4 (NormalFloat-4) and 'double quantisation' tricks that have since shaped much PEFT research. Most community fine-tunes today still rely on QLoRA — it is what enables many 'I trained my own model at home' stories.
External Links