@jbhuang0604: LoRA, low-rank adaptation, is arguably the most popular parameter-efficient fine-tuning method for LLMs. But how does i…
Summary
LoRA (low-rank adaptation) is the most popular parameter-efficient fine-tuning method for LLMs. This video introduces how LoRA and its variants (LoRA+, QLoRA, VeRA, DoRA) work.
View Cached Full Text
Cached at: 06/27/26, 03:51 AM
LoRA, low-rank adaptation, is arguably the most popular parameter-efficient fine-tuning method for LLMs.
But how does it actually work?
Check out the video to learn LoRA and friends (LoRA+, QLoRA, VeRA, and DoRA)!
https://t.co/Bt9ky4oKnk https://t.co/9nUhaG5a8q
TL;DR
LoRA dramatically reduces trainable parameters via low-rank decomposition, making large model fine-tuning efficient. Subsequent methods like QLoRA, VeRA, and DoRA further reduce memory requirements or improve adaptation flexibility.
The Necessity of Parameter-Efficient Fine-Tuning
Modern AI models are extremely large. Full fine-tuning (updating all weights) is costly: a 70 billion parameter model at 16-bit precision requires about 140 GB of memory just for the weights. If we save a full copy for each task, storage costs quickly become unmanageable. Therefore, we need parameter-efficient adaptation methods.
Core Idea of LoRA
In Transformers, most computation occurs in linear layers (Q/K/V projections in attention, expansion and compression in feedforward networks). These layers perform the same operation: ( Y = W \cdot X ).
Instead of fine-tuning the entire weight matrix ( W ) directly, we freeze the original pre-trained weights ( W_0 ) and add a trainable correction matrix ( \Delta W ). After fine-tuning, ( W_0 + \Delta W ) can be merged into a single matrix, making inference speed identical to the original model.
Low-Rank Decomposition
Key assumption: The correction matrix ( \Delta W ) has a low-rank structure. For example, the column vectors of a 6×6 matrix may be redundant, with only two independent directions (rank=2). Thus, we can decompose ( \Delta W ) into two smaller matrices ( B ) (tall and skinny) and ( A ) (short and wide), i.e., ( \Delta W = B A ). This is low-rank adaptation (LoRA). By choosing a rank ( r ) much smaller than the input/output dimensions, the number of trainable parameters is greatly reduced.
In practice, LoRA controls the update strength with a scaling factor ( \alpha / r ), keeping the update size stable across different ranks.
Key-Value Interpretation
Decompose ( B ) into column vectors ( \mathbf{b}_i ) and ( A ) into row vectors ( \mathbf{a}i ), then: [ \Delta Y = B A X = \sum{i=1}^{r} \mathbf{b}_i (\mathbf{a}_i \cdot X) ] Here, ( \mathbf{a}_i ) act like feature detectors (keys), and ( \mathbf{b}_i ) specify the corresponding corrections (values). For example, when fine-tuning on medical literature, one key might detect heart disease, and the corresponding value adds cardiology expertise. LoRA acts as a local associative memory: only task-relevant inputs activate corrections, while the base model retains general knowledge.
Initialization Strategy
Initially, we want ( \Delta Y = 0 ) to avoid random perturbations disrupting pre-trained knowledge. Therefore, one matrix starts from zero and the other from random. A common choice: randomly initialize ( A ) and zero-initialize ( B ). This way, ( B ) receives gradients from the first step, and both become learnable. Random ( A ) provides diverse initial feature directions, benefiting early learning.
LoRA+: Different Learning Rates
Matrix ( A ) maps the wide input into a low-rank space, and matrix ( B ) maps the low-rank features back to the output space. Their scales differ; using the same learning rate would cause ( B ) to be under-trained. LoRA+ assigns a larger learning rate to ( B ) and a smaller one to ( A ), accelerating convergence and improving final performance.
QLoRA: Quantized LoRA
Although LoRA reduces trainable parameters, the frozen base model still requires significant memory. QLoRA compresses the base model via quantization, enabling full fine-tuning on a single 48GB GPU (for a 70B parameter model with rank 64, trainable parameters occupy only 1.17GB).
NF4 Quantization
It uses 4-bit quantization, but equally spaced 16 levels waste precision (weights are mostly concentrated near zero). NormalFloat 4 (NF4) uses quantile quantization, placing levels according to a zero-mean normal distribution so that each interval contains a proportional number of weights. Since 4 bits have 16 levels (an even number), to include zero, the construction is asymmetric: 8 levels from -1 to 0, 9 levels from 0 to 1, removing the duplicate zero level yields 16 in total. NF4 allocates precision to weight-dense regions, reducing quantization error.
Block-wise Quantization and Double Quantization
If a single scaling factor is used for the entire model, a single outlier would dominate the scaling. Therefore, weights are divided into small blocks (e.g., 64 elements per block), and each block computes its own scaling factor. However, the scaling factors themselves also need storage (32 bits), adding an extra 0.5 bits per parameter. QLoRA further quantizes the scaling factors to 8 bits (double quantization), saving additional memory.
VeRA: Shared Random Directions
In LoRA, each linear layer learns two small matrices, and the accumulated parameters across layers are still significant. VeRA uses a pair of randomly initialized matrices that are frozen and shared across all adapted layers. Each layer only learns two trainable diagonal vectors: ( \mathbf{b} ) (starting from zero) and ( \mathbf{d} ) (starting from a small constant). The weight update is: [ \Delta W = \text{diag}(\mathbf{b}) \cdot B_{\text{fixed}} \cdot \text{diag}(\mathbf{d}) \cdot A_{\text{fixed}} ] The number of trainable parameters is minimal. For example, with input dimension 4096, rank 64, and 80 layers, standard LoRA requires about 42 million parameters, while VeRA requires only about 333,000 (a 126x reduction). VeRA does not learn directions; it only learns the intensity of using the frozen random directions.
DoRA: Weight-Decomposed Low-Rank Adaptation
LoRA’s correction to the weight matrix changes both direction and magnitude, coupling them together. DoRA decomposes each output neuron’s weight into magnitude (the norm of the row vector) and unit direction. Pre-trained weights ( W_0 = \text{diag}(\mathbf{m}) \cdot \hat{W} ). Then, it applies a LoRA-style update to the directional part and learns separate magnitude parameters. This decoupling makes adaptation closer to full fine-tuning while adding very few extra parameters.
Source: YouTube video link (https://www.youtube.com/watch?v=U80tjcThl9Q)
Similar Articles
Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
Hybrid-LoRA proposes a framework that selectively applies full fine-tuning to a small subset of modules while using LoRA for the rest, achieving performance near full fine-tuning with significantly lower computational cost. Experiments show improvements of up to 5.65% over existing parameter-efficient baselines.
Parameter-Efficient Fine-Tuning with Learnable Rank
Researchers from Adelaide University introduce LR-LoRA (Learnable Rank LoRA), a parameter-efficient fine-tuning method that dynamically learns the adapter rank for each transformer layer during training rather than using a fixed global rank. LR-LoRA achieves state-of-the-art performance on language understanding and commonsense reasoning benchmarks, outperforming fixed-rank LoRA baselines.
BaLoRA: Bayesian Low-Rank Adaptation of Large Scale Models
BaLoRA introduces a Bayesian extension to Low-Rank Adaptation (LoRA) that provides calibrated uncertainty estimates and improves prediction accuracy by narrowing the gap with full fine-tuning.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.
@0xSero: Highly recommended educational content. LoRA is one of the coolest things to dabble in, lets anyone fine tune models re…
This article delves into the principles of LoRA and its variants (QLoRA, VeRA, DoRA), explaining how low-rank decomposition reduces trainable parameters to enable efficient fine-tuning of large models.