@jbhuang0604: LoRA, low-rank adaptation, is arguably the most popular parameter-efficient fine-tuning method for LLMs. But how does i…

X AI KOLs Timeline 06/25/26, 03:00 PM Models

lora low-rank-adaptation parameter-efficient-fine-tuning qlora vera dora fine-tuning llm

Summary

LoRA (low-rank adaptation) is the most popular parameter-efficient fine-tuning method for LLMs. This video introduces how LoRA and its variants (LoRA+, QLoRA, VeRA, DoRA) work.

LoRA, low-rank adaptation, is arguably the most popular parameter-efficient fine-tuning method for LLMs. But how does it actually work? Check out the video to learn LoRA and friends (LoRA+, QLoRA, VeRA, and DoRA)! https://t.co/Bt9ky4oKnk https://t.co/9nUhaG5a8q

Original Article

View Cached Full Text

Cached at: 06/27/26, 03:51 AM

LoRA, low-rank adaptation, is arguably the most popular parameter-efficient fine-tuning method for LLMs.

But how does it actually work?

Check out the video to learn LoRA and friends (LoRA+, QLoRA, VeRA, and DoRA)!

https://t.co/Bt9ky4oKnk https://t.co/9nUhaG5a8q

TL;DR

LoRA dramatically reduces trainable parameters via low-rank decomposition, making large model fine-tuning efficient. Subsequent methods like QLoRA, VeRA, and DoRA further reduce memory requirements or improve adaptation flexibility.

The Necessity of Parameter-Efficient Fine-Tuning

Modern AI models are extremely large. Full fine-tuning (updating all weights) is costly: a 70 billion parameter model at 16-bit precision requires about 140 GB of memory just for the weights. If we save a full copy for each task, storage costs quickly become unmanageable. Therefore, we need parameter-efficient adaptation methods.

Core Idea of LoRA

In Transformers, most computation occurs in linear layers (Q/K/V projections in attention, expansion and compression in feedforward networks). These layers perform the same operation: ( Y = W \cdot X ).

Instead of fine-tuning the entire weight matrix ( W ) directly, we freeze the original pre-trained weights ( W_0 ) and add a trainable correction matrix ( \Delta W ). After fine-tuning, ( W_0 + \Delta W ) can be merged into a single matrix, making inference speed identical to the original model.

Low-Rank Decomposition

Key assumption: The correction matrix ( \Delta W ) has a low-rank structure. For example, the column vectors of a 6×6 matrix may be redundant, with only two independent directions (rank=2). Thus, we can decompose ( \Delta W ) into two smaller matrices ( B ) (tall and skinny) and ( A ) (short and wide), i.e., ( \Delta W = B A ). This is low-rank adaptation (LoRA). By choosing a rank ( r ) much smaller than the input/output dimensions, the number of trainable parameters is greatly reduced.

In practice, LoRA controls the update strength with a scaling factor ( \alpha / r ), keeping the update size stable across different ranks.

Key-Value Interpretation

Decompose ( B ) into column vectors ( \mathbf{b}_i ) and ( A ) into row vectors ( \mathbf{a}i ), then: [ \Delta Y = B A X = \sum{i=1}^{r} \mathbf{b}_i (\mathbf{a}_i \cdot X) ] Here, ( \mathbf{a}_i ) act like feature detectors (keys), and ( \mathbf{b}_i ) specify the corresponding corrections (values). For example, when fine-tuning on medical literature, one key might detect heart disease, and the corresponding value adds cardiology expertise. LoRA acts as a local associative memory: only task-relevant inputs activate corrections, while the base model retains general knowledge.

Initialization Strategy

Initially, we want ( \Delta Y = 0 ) to avoid random perturbations disrupting pre-trained knowledge. Therefore, one matrix starts from zero and the other from random. A common choice: randomly initialize ( A ) and zero-initialize ( B ). This way, ( B ) receives gradients from the first step, and both become learnable. Random ( A ) provides diverse initial feature directions, benefiting early learning.

LoRA+: Different Learning Rates

Matrix ( A ) maps the wide input into a low-rank space, and matrix ( B ) maps the low-rank features back to the output space. Their scales differ; using the same learning rate would cause ( B ) to be under-trained. LoRA+ assigns a larger learning rate to ( B ) and a smaller one to ( A ), accelerating convergence and improving final performance.

QLoRA: Quantized LoRA

Although LoRA reduces trainable parameters, the frozen base model still requires significant memory. QLoRA compresses the base model via quantization, enabling full fine-tuning on a single 48GB GPU (for a 70B parameter model with rank 64, trainable parameters occupy only 1.17GB).

NF4 Quantization

It uses 4-bit quantization, but equally spaced 16 levels waste precision (weights are mostly concentrated near zero). NormalFloat 4 (NF4) uses quantile quantization, placing levels according to a zero-mean normal distribution so that each interval contains a proportional number of weights. Since 4 bits have 16 levels (an even number), to include zero, the construction is asymmetric: 8 levels from -1 to 0, 9 levels from 0 to 1, removing the duplicate zero level yields 16 in total. NF4 allocates precision to weight-dense regions, reducing quantization error.

Block-wise Quantization and Double Quantization

If a single scaling factor is used for the entire model, a single outlier would dominate the scaling. Therefore, weights are divided into small blocks (e.g., 64 elements per block), and each block computes its own scaling factor. However, the scaling factors themselves also need storage (32 bits), adding an extra 0.5 bits per parameter. QLoRA further quantizes the scaling factors to 8 bits (double quantization), saving additional memory.

VeRA: Shared Random Directions

In LoRA, each linear layer learns two small matrices, and the accumulated parameters across layers are still significant. VeRA uses a pair of randomly initialized matrices that are frozen and shared across all adapted layers. Each layer only learns two trainable diagonal vectors: ( \mathbf{b} ) (starting from zero) and ( \mathbf{d} ) (starting from a small constant). The weight update is: [ \Delta W = \text{diag}(\mathbf{b}) \cdot B_{\text{fixed}} \cdot \text{diag}(\mathbf{d}) \cdot A_{\text{fixed}} ] The number of trainable parameters is minimal. For example, with input dimension 4096, rank 64, and 80 layers, standard LoRA requires about 42 million parameters, while VeRA requires only about 333,000 (a 126x reduction). VeRA does not learn directions; it only learns the intensity of using the frozen random directions.

DoRA: Weight-Decomposed Low-Rank Adaptation

LoRA’s correction to the weight matrix changes both direction and magnitude, coupling them together. DoRA decomposes each output neuron’s weight into magnitude (the norm of the row vector) and unit direction. Pre-trained weights ( W_0 = \text{diag}(\mathbf{m}) \cdot \hat{W} ). Then, it applies a LoRA-style update to the directional part and learns separate magnitude parameters. This decoupling makes adaptation closer to full fine-tuning while adding very few extra parameters.

Source: YouTube video link (https://www.youtube.com/watch?v=U80tjcThl9Q)

@jbhuang0604: LoRA, low-rank adaptation, is arguably the most popular parameter-efficient fine-tuning method for LLMs. But how does i…

TL;DR

The Necessity of Parameter-Efficient Fine-Tuning

Core Idea of LoRA

Low-Rank Decomposition

Key-Value Interpretation

Initialization Strategy

LoRA+: Different Learning Rates

QLoRA: Quantized LoRA

NF4 Quantization

Block-wise Quantization and Double Quantization

VeRA: Shared Random Directions

DoRA: Weight-Decomposed Low-Rank Adaptation

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

Parameter-Efficient Fine-Tuning with Learnable Rank

BaLoRA: Bayesian Low-Rank Adaptation of Large Scale Models

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

@0xSero: Highly recommended educational content. LoRA is one of the coolest things to dabble in, lets anyone fine tune models re…

Submit Feedback

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

Parameter-Efficient Fine-Tuning with Learnable Rank

BaLoRA: Bayesian Low-Rank Adaptation of Large Scale Models

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

@0xSero: Highly recommended educational content. LoRA is one of the coolest things to dabble in, lets anyone fine tune models re…