BaLoRA: Bayesian Low-Rank Adaptation of Large Scale Models
Summary
BaLoRA introduces a Bayesian extension to Low-Rank Adaptation (LoRA) that provides calibrated uncertainty estimates and improves prediction accuracy by narrowing the gap with full fine-tuning.
View Cached Full Text
Cached at: 05/12/26, 06:41 AM
# Bayesian Low-Rank Adaptation of Large Scale Models
Source: [https://arxiv.org/html/2605.08110](https://arxiv.org/html/2605.08110)
Dario Coscia mathLab, SISSA University of Amsterdam &Sindy Löwe CuspAI &Max Welling CuspAI University of AmsterdamWork done in collaboration with the University of Amsterdam\. Correspondence todario\.coscia@sissa\.it\.
###### Abstract
Low\-Rank Adaptation \(LoRA\) has become the standard for fine\-tuning large pre\-trained models at reduced computational cost\. However, its low\-rank point\-estimate updates limit expressiveness, leave a persistent gap relative to full fine\-tuning accuracy, and provide no built\-in uncertainty quantification, limiting its applicability in settings where reliability matters as much as accuracy\. We introduce BaLoRA, a Bayesian extension of LoRA with a novel input\-adaptive Bayesian parameterization of LoRA matrices that adds minimal parameters and compute\. Surprisingly, not only does the Bayesian extension yield well\-calibrated uncertainty estimates, but the adaptive noise injection underlying our approach also significantly improves prediction accuracy, narrowing the gap with full fine\-tuning across both natural language reasoning and vision tasks\. When applied to band gap prediction in metal\-organic frameworks, BaLoRA produces zero\-shot test\-time uncertainty estimates that correlate more strongly with model error than a trained ensemble of LoRA models, and improve monotonically with compute without sacrificing accuracy\.
## 1Introduction
Large pre\-trained models have shown remarkable generalization across a wide range of tasks, from natural language processing\(Qinet al\.,[2023](https://arxiv.org/html/2605.08110#bib.bib18); Touvronet al\.,[2023](https://arxiv.org/html/2605.08110#bib.bib20)\)and multi\-modal applications\(Liuet al\.,[2023](https://arxiv.org/html/2605.08110#bib.bib19)\)to atomistic simulations\(Woodet al\.,[2025](https://arxiv.org/html/2605.08110#bib.bib15); Shoghiet al\.,[2024](https://arxiv.org/html/2605.08110#bib.bib16); Kanget al\.,[2023](https://arxiv.org/html/2605.08110#bib.bib17)\)\. Adapting these models to downstream tasks typically relies on full fine\-tuning \(FT\), which updates all parameters using task\-specific data\. However, as model sizes grow, FT becomes impractical due to high computational and memory costs\. Even when feasible, deploying multiple fine\-tuned instances is challenging, as each retains the full model footprint\.
Figure 1:Overview of BaLoRA\.\(Left\)BaLoRA extends LoRA by treating the reduction matrix𝜽A\\bm\{\\theta\}\_\{A\}as a random variable with input\-adaptive Gaussian noise,𝝎A∼𝒩\(𝜽A,α\(𝒙\)𝜽A2\)\\bm\{\\omega\}\_\{A\}\\sim\\mathcal\{N\}\(\\bm\{\\theta\}\_\{A\},\\,\\alpha\(\\bm\{x\}\)\\bm\{\\theta\}\_\{A\}^\{2\}\), while𝜽0\\bm\{\\theta\}\_\{0\}stays frozen\. At inference, BaLoRA runs indeterministicmode \(zero latency, merged adapter\) orstochasticmode \(calibrated uncertainty via sampling\)\.\(Right\)BaLoRA places an input\-adaptive uncertainty ellipsoid around the LoRA point estimate, confined to the low\-rank space\. Uncertainty scales with input activations, concentrating the gradient signal where adaptation is active and acting as an implicit regulariser on the pretrained representations\.Parameter\-efficient fine\-tuning \(PEFT\)\(Houlsbyet al\.,[2019](https://arxiv.org/html/2605.08110#bib.bib21)\)addresses this by updating only a small subset of parameters while freezing the rest, improving efficiency without sacrificing much performance\. Low\-Rank Adaptation \(LoRA\)\(Huet al\.,[2022](https://arxiv.org/html/2605.08110#bib.bib1)\)is a widely used PEFT method that constrains updates to a low\-rank structure, capturing task\-specific changes with far fewer parameters\. However, LoRA and related approaches are limited by their low\-rank, point\-estimate updates, which reduce expressiveness and leave a gap compared to FT\(Liuet al\.,[2024a](https://arxiv.org/html/2605.08110#bib.bib22)\)\. Additionally, they lack uncertainty quantification, which is crucial in safety\-critical and data\-scarce settings\. This is especially important in scientific domains such as materials science and weather forecasting, where reliable uncertainty estimates are essential\(Woodet al\.,[2025](https://arxiv.org/html/2605.08110#bib.bib15); Bodnaret al\.,[2025](https://arxiv.org/html/2605.08110#bib.bib23); Cosciaet al\.,[2025a](https://arxiv.org/html/2605.08110#bib.bib2),[b](https://arxiv.org/html/2605.08110#bib.bib24)\)\.
Drawing on the limitations of existing PEFT methods and the growing need for reliable uncertainty estimates in scientific applications, we present BaLoRA, a Bayesian extension of LoRA that addresses both challenges simultaneously\. BaLoRA introduces a novel input\-adaptive Bayesian reparametrization of the LoRA reduction matrix, treating its entries as random variables rather than fixed point estimates\. This allows uncertainty to be naturally encoded during training via adaptive noise injection, which encourages the reduction matrix to focus on inputs associated with high uncertainty, leading to improved task performance\. At inference, BaLoRA can be deployed in deterministic mode by merging adapter weights as in standard LoRA, incurring no additional latency, or in Bayesian mode when uncertainty estimates are required\. Our main contributions are as follows:
- •We propose a stochastic PEFT method with an input\-adaptive Bayesian reparametrization of the LoRA reduction matrix, where input\-dependent noise injection acts as an implicit regulariser that eases training, resulting in performance improvements, while enabling principled uncertainty quantification\.
- •We derive a low\-rank local reparametrization trick that exploits the LoRA factorization to sample the posterior predictive exactly in the low\-dimensional latent space, avoiding materialisation of the full output covariance and matching the computational scaling of standard LoRA\.
- •We achieve state\-of\-the\-art commonsense reasoning on 6/8 benchmarks \(Llama\-3\-8B\) and image classification \(ViT\-L/16\), and demonstrate the practical value of BaLoRA’s uncertainty estimates in a real\-world scientific setting \(MOF property prediction\), where BaLoRA outperforms a trained LoRA ensemble with only a single fine\-tuning run\.
## 2Background and Related Works
In this section, we present the relevant background on Parameter\-Efficient Fine\-Tuning, Low\-Rank Adapters, Bayesian modelling, and Uncertainty Quantification, which will later support the formulation of our proposed method\.
### 2\.1Parameter\-Efficient Fine\-Tuning \(PEFT\)
Parameter\-efficient fine\-tuning \(PEFT\)\(Houlsbyet al\.,[2019](https://arxiv.org/html/2605.08110#bib.bib21)\)adapts large pre\-trained models to downstream tasks at low computational cost by updating only a small fraction of parameters\. Two main categories can be distinguished\.*Prompt\-based*methods\(Lesteret al\.,[2021](https://arxiv.org/html/2605.08110#bib.bib25); Razdaibiedinaet al\.,[2023](https://arxiv.org/html/2605.08110#bib.bib26); Liuet al\.,[2024b](https://arxiv.org/html/2605.08110#bib.bib27)\)prepend learnable soft tokens to the model input, optimizing only these additional vectors while keeping all weights frozen\.*Adapter\-based*methods\(Houlsbyet al\.,[2019](https://arxiv.org/html/2605.08110#bib.bib21); Mahabadiet al\.,[2021](https://arxiv.org/html/2605.08110#bib.bib28)\)instead inject small trainable modules into the frozen model, offering strong performance and training stability\. However, most adapter\-based approaches incur inference overhead, limiting their practical applicability\.
Low\-Rank Adapters such as LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08110#bib.bib1)\)introduce no additional inference latency, built on the observation that fine\-tuning updates exhibit low intrinsic rank\(Aghajanyanet al\.,[2021](https://arxiv.org/html/2605.08110#bib.bib29)\)\. Since LoRA can underperform full fine\-tuning across diverse tasks, several extensions have been proposed\. DoRA\(Liuet al\.,[2024a](https://arxiv.org/html/2605.08110#bib.bib22)\)separates weight updates into magnitude and directional components, applying a LoRA\-style update only to the directional part\. MoRA\(Jianget al\.,[2024](https://arxiv.org/html/2605.08110#bib.bib38)\)projects inputs into a compressed space, applies transformations with a higher\-rank matrix, and reconstructs them back to the original space\. HiRA\(Huanget al\.,[2025](https://arxiv.org/html/2605.08110#bib.bib39)\)leverages a Hadamard product to preserve high\-rank update information, enhancing the expressive capacity of the model\.
### 2\.2Bayesian Models and Uncertainty Quantification
Bayesian methods account for uncertainty in deep learning by framing learning as probabilistic inference\(Hinton and Van Camp,[1993](https://arxiv.org/html/2605.08110#bib.bib8); Welling and Teh,[2011](https://arxiv.org/html/2605.08110#bib.bib13); Graves,[2011](https://arxiv.org/html/2605.08110#bib.bib9); Gal and Ghahramani,[2016](https://arxiv.org/html/2605.08110#bib.bib10)\), treating weights as random variables updated from prior to posterior via Bayes’ theorem\(Bayes,[1763](https://arxiv.org/html/2605.08110#bib.bib7)\)\. Since exact posterior computation is intractable at scale, practical approaches rely on approximations\. Deep Ensembles combine independently trained models for implicit posterior estimation\(Lakshminarayananet al\.,[2017](https://arxiv.org/html/2605.08110#bib.bib14)\), extending naturally to parameter\-efficient settings such as ensemble LoRA\(Wanget al\.,[2023](https://arxiv.org/html/2605.08110#bib.bib11)\)\. Alternatively, variational methods cast posterior estimation as an optimization over tractable distributions\(Blundellet al\.,[2015](https://arxiv.org/html/2605.08110#bib.bib12); Kingmaet al\.,[2015](https://arxiv.org/html/2605.08110#bib.bib6)\); notably, Variational Adaptive Dropout\(Cosciaet al\.,[2025a](https://arxiv.org/html/2605.08110#bib.bib2)\)achieves ensemble\-comparable performance and uncertainty at the cost of a single network\.
## 3Methods
In this section, we introduce Bayesian Low\-Rank Adaptation \(BaLoRA\), a novel parameter\-efficient fine\-tuning strategy that builds upon Variational Adaptive Dropout\(Cosciaet al\.,[2025a](https://arxiv.org/html/2605.08110#bib.bib2)\)and extends the LoRA method by naturally encoding uncertainty during both training and inference\. The key innovation of BaLoRA lies in the Bayesian reparametrization of the learning process, in which an adaptive variance is incorporated into the LoRA reduction matrix \(see Figure[1](https://arxiv.org/html/2605.08110#S1.F1)\)\. During training, the input\-adaptive noise acts as a stochastic regulariser where larger activations in𝝎A\\bm\{\\omega\}\_\{A\}induce larger perturbations, leading the model to learn a more robust low\-rank representation\. Critically the adaptivity ensures perturbations are concentrated where activations are largest, preventing overfitting precisely in the most active directions while leaving stable ones intact\. At test time, BaLoRA can be deployed in deterministic inference mode with no additional latency over LoRA, or in Bayesian mode when uncertainty quantification is required\.
### 3\.1Low Rank Adaptation
LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08110#bib.bib1)\)is a fine\-tuning strategy that freezes the pre\-trained model weights and injects trainable rank decomposition matrices into each linear layer of a neural network\. For a pre\-trained weight matrix𝜽0∈ℝk×d\\bm\{\\theta\}\_\{0\}\\in\\mathbb\{R\}^\{k\\times d\}, the update is constrained to a low\-rank factorization𝜽B𝜽A\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}, where𝜽A∈ℝr×d\\bm\{\\theta\}\_\{A\}\\in\\mathbb\{R\}^\{r\\times d\}is the*reduction*matrix, and𝜽B∈ℝk×r\\bm\{\\theta\}\_\{B\}\\in\\mathbb\{R\}^\{k\\times r\}is the*reconstruction*matrix, with rankr≪min\(d,k\)r\\ll\\min\(d,k\)\. Given input𝒙∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\}, the output is:
𝒚=𝜽0𝒙\+𝜽B𝜽A𝒙\.\\bm\{y\}=\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\\bm\{x\}\.\(1\)Usually,𝜽A\\bm\{\\theta\}\_\{A\}is initialized from an isotropic normal with a small standard deviation, and𝜽B\\bm\{\\theta\}\_\{B\}is initialized to zero, ensuring the adapter contributes no signal at the start of training\. Only𝜽A\\bm\{\\theta\}\_\{A\}and𝜽B\\bm\{\\theta\}\_\{B\}are optimized; at inference, the adapter is merged into the pre\-trained weights as𝜽=𝜽0\+𝜽B𝜽A\\bm\{\\theta\}=\\bm\{\\theta\}\_\{0\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}, incurring no additional inference cost\.
### 3\.2Bayesian Adaptive LoRA
To incorporate uncertainty into LoRA, we treat the entries of the reduction matrix as random variables rather than fixed parameters\. We indicate with𝝎A\\bm\{\\omega\}\_\{A\}the random reduction matrix\. To obtain the random reduction matrix, we perturb each entry of the deterministic reduction matrix𝜽A\\bm\{\\theta\}\_\{A\}by input\-dependent Gaussian noise as done in Variational Adaptive Dropout\(Cosciaet al\.,[2025a](https://arxiv.org/html/2605.08110#bib.bib2)\):
ωA;ij=θA;ij\+α\(𝒙\)θA;ij2ϵij,ϵij∼𝒩\(0,1\)\.⇔q\(𝝎A∣𝒙\)=∏i=1r∏j=1d𝒩\(θA;ij,α\(𝒙\)θA;ij2\)\\begin\{split\}\\omega\_\{A;ij\}&=\\theta\_\{A;ij\}\+\\sqrt\{\\alpha\(\\bm\{x\}\)\\theta\_\{A;ij\}^\{2\}\}\\bm\{\\epsilon\}\_\{ij\},\\quad\\bm\{\\epsilon\}\_\{ij\}\\sim\\mathcal\{N\}\(0,1\)\.\\\\ \\iff q\(\\bm\{\\omega\}\_\{A\}\\mid\\bm\{x\}\)&=\\prod\_\{i=1\}^\{r\}\\prod\_\{j=1\}^\{d\}\\mathcal\{N\}\(\\theta\_\{A;ij\},\\alpha\(\\bm\{x\}\)\\theta\_\{A;ij\}^\{2\}\)\\end\{split\}\(2\)Here,α\(𝒙\)\\alpha\(\\bm\{x\}\)is a lightweight inference network that modulates the overall scale of the variance as a function of the input𝒙\\bm\{x\}\. Intuitively, weights with larger magnitude carry proportionally more uncertainty, withα\\alphacontrolling how much the input itself drives that uncertainty\.
Under this posterior, since𝜽B\\bm\{\\theta\}\_\{B\}is deterministic, the predictive distribution of the output given the input is exactly Gaussian \(see Appendix[A\.1](https://arxiv.org/html/2605.08110#A1.SS1)for derivations\):
𝒚∼𝒩\(𝜽0𝒙\+𝜽B𝜽A𝒙,α𝜽Bdiag\(𝜽A2𝒙2\)𝜽B⊤\),\\bm\{y\}\\sim\\mathcal\{N\}\\Big\(\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\\bm\{x\},\\;\\alpha\\,\\bm\{\\theta\}\_\{B\}\\,\\mathrm\{diag\}\(\\bm\{\\theta\}\_\{A\}^\{2\}\\bm\{x\}^\{2\}\)\\,\\bm\{\\theta\}\_\{B\}^\{\\top\}\\Big\),\(3\)where squares are applied element\-wise\. Crucially, the mean coincides with the standard LoRA forward pass, so at inference the weights can still be merged as𝜽=𝜽0\+𝜽B𝜽A\\bm\{\\theta\}=\\bm\{\\theta\}\_\{0\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}, incurring no additional inference cost\. Additionally, due to the multiplicative noise choice, the only added parameters with respect to LoRA come from the inference network, which is usually light\-weight and negligible\.
#### Training and KL divergence\.
Variational Adaptive Dropout trains the Bayesian model by maximising the Evidence Lower Bound, which trades off between data fit \(the expected log\-likelihood term\) and model complexity \(KL divergence between the posterior and the prior\)\. The ELBO is given by:
ℒ=𝔼𝝎A∼q\(𝝎A∣𝒙\)\[logp\(𝐲∣𝝎A,𝒙\)\]−DKL\[q\(𝝎A∣𝒙\)∣p\(𝝎A\)\]\.\\mathcal\{L\}=\\mathbb\{E\}\_\{\\bm\{\\omega\}\_\{A\}\\sim q\(\\bm\{\\omega\}\_\{A\}\\mid\\bm\{x\}\)\}\\left\[\\log p\(\\mathbf\{y\}\\mid\\bm\{\\omega\}\_\{A\},\\bm\{x\}\)\\right\]\-D\_\{KL\}\\left\[q\(\\bm\{\\omega\}\_\{A\}\\mid\\bm\{x\}\)\\mid p\(\\bm\{\\omega\}\_\{A\}\)\\right\]\.\(4\)In practice, a single Monte Carlo sample from the variational posterior at each training iteration can well approximate Eq\. \([4](https://arxiv.org/html/2605.08110#S3.E4)\)\. We used the same weakly informed prior ofCosciaet al\.\([2025a](https://arxiv.org/html/2605.08110#bib.bib2)\):
p\(𝝎A\)=∏i=1r∏j=1d𝒩\(0,p1−pθA;ij2\),p\(\\bm\{\\omega\}\_\{A\}\)=\\prod\_\{i=1\}^\{r\}\\prod\_\{j=1\}^\{d\}\\mathcal\{N\}\\left\(0,\\frac\{p\}\{1\-p\}\\theta\_\{A;ij\}^\{2\}\\right\),\(5\)which can be interpreted as placing dropout to the network with dropout probabilitypp\. Under this choice, the KL divergence between posterior and prior admits a closed form:
DKL\[q\(𝝎A∣𝒙\)∣p\(𝝎A\)\]=12\(\(α\+1\)\(1\+p\)p\+logp1−p−log\(α\)−1\),D\_\{KL\}\[q\(\\bm\{\\omega\}\_\{A\}\\mid\\bm\{x\}\)\\mid p\(\\bm\{\\omega\}\_\{A\}\)\]=\\frac\{1\}\{2\}\\left\(\\frac\{\(\\alpha\+1\)\(1\+p\)\}\{p\}\+\\log\{\\frac\{p\}\{1\-p\}\}\-\\log\(\\alpha\)\-1\\right\),\(6\)which depends only on the scalarα\\alphaand dropout ratepp, making it cheap to evaluate and differentiate through\. A complete derivation is available in Appendix[A\.2](https://arxiv.org/html/2605.08110#A1.SS2)\.
### 3\.3Low\-Rank local reparametrization trick
The local reparametrization trick\(Kingmaet al\.,[2015](https://arxiv.org/html/2605.08110#bib.bib6)\)was introduced to sample the predictive distribution of Bayesian linear layers efficiently\. For a single linear layer with input𝒙\\bm\{x\}and weight matrix𝝎∼𝒩\(𝜽,α𝜽2\)\\bm\{\\omega\}\\sim\\mathcal\{N\}\(\\bm\{\\theta\},\\alpha\\bm\{\\theta\}^\{2\}\), the output𝒚\\bm\{y\}also follows a normal distribution:
𝒚∼𝒩\(𝜸,diag\(𝜹\)\),\\bm\{y\}\\sim\\mathcal\{N\}\(\\bm\{\\gamma\},\\mathrm\{diag\}\(\\bm\{\\delta\}\)\),\(7\)where each output elementyjy\_\{j\}can be written as
yj=γj\+δjϵj,γj=∑ixiθij,δj=α∑ixi2θij2,ϵj∼𝒩\(0,1\)\.y\_\{j\}=\\gamma\_\{j\}\+\\sqrt\{\\delta\_\{j\}\}\\,\\epsilon\_\{j\},\\quad\\gamma\_\{j\}=\\sum\_\{i\}x\_\{i\}\\theta\_\{ij\},\\quad\\delta\_\{j\}=\\alpha\\sum\_\{i\}x\_\{i\}^\{2\}\\theta\_\{ij\}^\{2\},\\quad\\epsilon\_\{j\}\\sim\\mathcal\{N\}\(0,1\)\.\(8\)Therefore, rather than sampling the weight matrix𝝎\\bm\{\\omega\}directly, one samples the output𝒚\\bm\{y\}from a Gaussian whose mean and variance are computed analytically, reducing both variance and computational cost to two matrix multiplications\.
#### Low\-Rank Sampling\.
In BaLoRA, a naive application of the local reparametrization trick is not directly viable\. Because the stochastic reduction matrix𝜽A\\bm\{\\theta\}\_\{A\}is followed by the deterministic projection𝜽B\\bm\{\\theta\}\_\{B\}, the posterior predictive covariance of the full LoRA update isΣ=α𝜽Bdiag\(𝜽A2𝒙2\)𝜽B⊤\\Sigma=\\alpha\\,\\bm\{\\theta\}\_\{B\}\\,\\mathrm\{diag\}\(\\bm\{\\theta\}\_\{A\}^\{2\}\\bm\{x\}^\{2\}\)\\,\\bm\{\\theta\}\_\{B\}^\{\\top\}, which is of size𝒪\(Bk2\)\\mathcal\{O\}\(Bk^\{2\}\)for a batch of sizeBBand output dimensionkk\. Explicitly forming and sampling from this matrix is computationally prohibitive, even for moderatekk\.
We instead exploit the low\-rank structure of the covariance to derive an efficient sampling procedure\. Define the element\-wise variance vector𝐝\(𝒙\)=α𝜽A2𝒙2∈ℝr\\mathbf\{d\}\(\\bm\{x\}\)=\\alpha\\,\\bm\{\\theta\}\_\{A\}^\{2\}\\bm\{x\}^\{2\}\\in\\mathbb\{R\}^\{r\}\. The covariance then factors as:
Σ=𝜽Bdiag\(𝐝\(𝒙\)\)𝜽B⊤=𝐀\(𝒙\)𝐀\(𝒙\)⊤,𝐀\(𝒙\)=𝜽Bdiag\(𝐝\(𝒙\)\),\\Sigma=\\bm\{\\theta\}\_\{B\}\\,\\mathrm\{diag\}\(\\mathbf\{d\}\(\\bm\{x\}\)\)\\,\\bm\{\\theta\}\_\{B\}^\{\\top\}=\\mathbf\{A\}\(\\bm\{x\}\)\\mathbf\{A\}\(\\bm\{x\}\)^\{\\top\},\\quad\\mathbf\{A\}\(\\bm\{x\}\)=\\bm\{\\theta\}\_\{B\}\\,\\mathrm\{diag\}\\\!\\big\(\\sqrt\{\\mathbf\{d\}\(\\bm\{x\}\)\}\\big\),\(9\)where we used the identitydiag\(𝐝\)=diag\(𝐝\)diag\(𝐝\)\\mathrm\{diag\}\(\\mathbf\{d\}\)=\\mathrm\{diag\}\(\\sqrt\{\\mathbf\{d\}\}\)\\,\\mathrm\{diag\}\(\\sqrt\{\\mathbf\{d\}\}\)\. Drawingϵd∼𝒩\(𝟎,𝐈r\)\\bm\{\\epsilon\}\_\{d\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{r\}\), an exact sample from the posterior predictive is then obtained by \(see Appendix[A\.1](https://arxiv.org/html/2605.08110#A1.SS1)for derivations\):
𝒚=𝜽0𝒙\+𝜽B𝜽A𝒙\+𝜽B\(𝐝\(𝒙\)⊙ϵd\),\\bm\{y\}=\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\big\(\\sqrt\{\\mathbf\{d\}\(\\bm\{x\}\)\}\\odot\\bm\{\\epsilon\}\_\{d\}\\big\),\(10\)where⊙\\odotdenotes element\-wise multiplication\. Crucially, the noise is sampled in the low\-dimensional spaceℝr\\mathbb\{R\}^\{r\}before being projected up by𝜽B\\bm\{\\theta\}\_\{B\}, so the full covariance matrix is never materialised\. The resulting complexity is𝒪\(Bkr\)\\mathcal\{O\}\(Bkr\)in time and𝒪\(Br\+Bk\)\\mathcal\{O\}\(Br\+Bk\)in memory, matching the scaling of standard LoRA inference and adding only a constant factor overhead over it\. This stands in contrast to𝒪\(Bk2\)\\mathcal\{O\}\(Bk^\{2\}\)for naive sampling of the full covariance, making BaLoRA practical even at large output dimensions wherek≫rk\\gg r\.
### 3\.4Geometric Interpretation and Implicit Regularisation of BaLoRA
It is instructive to interpret BaLoRA geometrically\. Standard LoRA constrains the weight update to a low\-dimensional subspace of dimensionr≪kr\\ll k: the update𝜽B𝜽A𝒙\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\\bm\{x\}first projects the input onto therr\-dimensional row space of𝜽A\\bm\{\\theta\}\_\{A\}, and then lifts the result back to thekk\-dimensional output space via𝜽B\\bm\{\\theta\}\_\{B\}\. The expressivity of the adaptation is therefore deliberately restricted to the column space of𝜽B\\bm\{\\theta\}\_\{B\}, denoted𝒱=col\(𝜽B\)⊂ℝk\\mathcal\{V\}=\\mathrm\{col\}\(\\bm\{\\theta\}\_\{B\}\)\\subset\\mathbb\{R\}^\{k\}\.
BaLoRA preserves this structure while equipping it with an uncertainty estimate\. The noise term𝜽B\(𝐝\(𝒙\)⊙ϵd\)\\bm\{\\theta\}\_\{B\}\\big\(\\sqrt\{\\mathbf\{d\}\(\\bm\{x\}\)\}\\odot\\bm\{\\epsilon\}\_\{d\}\\big\)defines a random perturbation that, by construction, also lies in𝒱\\mathcal\{V\}\. Geometrically, the predictive distribution can be understood as placing an*input\-adaptive ellipsoid*around the LoRA point estimate, confined entirely to the subspace𝒱\\mathcal\{V\}:
𝒚=𝜽0𝒙⏟pretrained\+𝜽B𝜽A𝒙⏟LoRA update\+𝜽B\(𝐝\(𝒙\)⊙ϵd\)⏟uncertainty in𝒱\.\\bm\{y\}=\\underbrace\{\\bm\{\\theta\}\_\{0\}\\bm\{x\}\}\_\{\\text\{pretrained\}\}\+\\underbrace\{\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\\bm\{x\}\}\_\{\\text\{LoRA update\}\}\+\\underbrace\{\\bm\{\\theta\}\_\{B\}\\big\(\\sqrt\{\\mathbf\{d\}\(\\bm\{x\}\)\}\\odot\\bm\{\\epsilon\}\_\{d\}\\big\)\}\_\{\\text\{uncertainty in \}\\mathcal\{V\}\}\.\(11\)The shape of this ellipsoid \(Figure[1](https://arxiv.org/html/2605.08110#S1.F1)*Left*\) is governed by the diagonal matrixdiag\(𝐝\(𝒙\)\)=α\(𝒙\)diag\(𝜽A2𝒙2\)\\mathrm\{diag\}\(\\mathbf\{d\}\(\\bm\{x\}\)\)=\\alpha\(\\bm\{x\}\)\\,\\mathrm\{diag\}\(\\bm\{\\theta\}\_\{A\}^\{2\}\\bm\{x\}^\{2\}\), which has two geometrically meaningful properties\. First, the ellipsoid is*aligned with the LoRA basis*: its principal axes correspond to the columns of𝜽B\\bm\{\\theta\}\_\{B\}, so uncertainty is expressed in the same directions as the adaptation itself\. Second, the ellipsoid is*input\-adaptive*: its extent along each axis scales with𝐝\(𝒙\)\\mathbf\{d\}\(\\bm\{x\}\), meaning that inputs which produce large activations in the reduction matrix𝜽A\\bm\{\\theta\}\_\{A\}are assigned wider uncertainty, while inputs that lie in the null space of𝜽A\\bm\{\\theta\}\_\{A\}carry no uncertainty\.
We postulate that this geometric structure eases learning in two ways\. First, by confining stochastic perturbations to𝒱\\mathcal\{V\}, noise is prevented from propagating into the pretrained representations outside the adapted subspace, preserving the stability of directions the model has not chosen to adapt\. Second, within𝒱\\mathcal\{V\}, the input\-adaptive scaling has a multiplicative noise form, which induces larger perturbations precisely where activations are strongest, discouraging overconfident updates in the most active directions and therefore acting as an implicit regulariser\. Together, these effects allowα\(𝒙\)\\alpha\(\\bm\{x\}\)to focus on identifying genuinely ambiguous inputs within the adapted subspace\.
## 4Experiments
To evaluate BaLoRA, we design experiments spanning three domains: natural language, computer vision, and materials science\. We first demonstrate that introducing a Bayesian framework enhances LoRA’s optimization behavior under fixed hyperparameter settings, assessed on commonsense reasoning benchmarks\. On these tasks, we also compare BaLoRA against a range of Parameter\-Efficient Fine\-Tuning baselines using LLaMA2\-7B and LLaMA3\-8B, achieving leading performance across multiple settings, alongside an ablation study examining the role of the prior probability\. We subsequently broaden our scope to computer vision, evaluating BaLoRA against alternative PEFT methods when adapting a Vision Transformer to several image classification benchmarks\. We also examine BaLoRA’s uncertainty calibration and carry out a runtime and computational overhead comparison with other PEFT approaches\. Finally, we turn to materials science, applying BaLoRA to MOF property prediction\. Here, BaLoRA not only surpasses standard LoRA in predictive accuracy but also yields superior uncertainty estimates compared to a LoRA ensemble\. Experimental details and hyperparameters can be found in Appendix[B](https://arxiv.org/html/2605.08110#A2)\.
### 4\.1Commonsense Reasoning
We benchmark against prompt\-based approaches, specifically Prompt Tuning\(Lesteret al\.,[2021](https://arxiv.org/html/2605.08110#bib.bib25)\)and P\-tuning\(Liuet al\.,[2022](https://arxiv.org/html/2605.08110#bib.bib32)\), as well as low\-rank adaptation methods, namely LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.08110#bib.bib1)\), DoRA\(Liuet al\.,[2024a](https://arxiv.org/html/2605.08110#bib.bib22)\), MORA\(Jianget al\.,[2024](https://arxiv.org/html/2605.08110#bib.bib38)\), and HiRA\(Huanget al\.,[2025](https://arxiv.org/html/2605.08110#bib.bib39)\)\. All experiments are conducted on two open\-source Large Language Models \(LLMs\): Llama\-2\-7B\(Touvronet al\.,[2023](https://arxiv.org/html/2605.08110#bib.bib20)\)and Llama\-3\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.08110#bib.bib30)\)\. The commonsense reasoning evaluation covers 8 sub\-tasks, each with its own predefined train/test split\. Unless noted otherwise, we adopt the experimental setup and hyperparameter configuration ofHuanget al\.\([2025](https://arxiv.org/html/2605.08110#bib.bib39)\)\. We evaluate commonsense reasoning following the evaluation protocol ofHuet al\.\([2023](https://arxiv.org/html/2605.08110#bib.bib31)\); Liuet al\.\([2024a](https://arxiv.org/html/2605.08110#bib.bib22)\), where accuracy is used as the performance metric\. For each query, the model’s response is determined by identifying the first occurrence of a task\-specific keyword \(e\.g\., “true” or “false” for BoolQ\); if no such keyword is found, the response is marked incorrect\.
#### BaLoRA Training Dynamics Improve LoRA performance
To isolate the effect of our Bayesian components, we conduct a controlled comparison between LoRA and BaLoRA under identical settings\. We set the same weight initialization, dropout probability, and hyperparameters, therefore differing only in the stochastic adaptive injection and KL regularization introduced by BaLoRA\. As shown in Table[1](https://arxiv.org/html/2605.08110#S4.T1), BaLoRA consistently outperforms LoRA by an average margin of∼2\.9%\\sim 2\.9\\%on Llama\-2 and∼5%\\sim 5\\%on Llama\-3, supporting our hypothesis that exploiting geometric structure facilitates optimization\.
Table 1:Accuracy comparison between standard LoRA and BaLoRA using equivalent dropout coefficients\. For LoRA, dropout is applied directly to adapter weights; for BaLoRA, the dropout coefficient serves asppin the KL divergence loss formulation \(equation \([6](https://arxiv.org/html/2605.08110#S3.E6)\)\)\.Blueindicates percentage improvement over LoRA,redindicates percentage decrease\.
#### BaLoRA Achieves State\-of\-the\-Art on Commonsense Reasoning
Table[2](https://arxiv.org/html/2605.08110#S4.T2)summarizes accuracy across eight commonsense reasoning benchmarks\. BaLoRA ranks first on five out of eight tasks for Llama\-2\-7B and six out of eight tasks for Llama\-3\-8B, while remaining within a comparable parameter budget to all low\-rank baselines\. The additional traiable parameter count \(*Params*in Table[2](https://arxiv.org/html/2605.08110#S4.T2)\) is attributed to the inference network predictingα\\alpha, and is negligable compared to the baseline models\. Against the closest competitor, HiRA, BaLoRA delivers steady gains across the majority of sub\-tasks, suggesting that the Bayesian geometric prior provides an advantage over purely deterministic rank decomposition\. Taken together, these results establish BaLoRA as a robust and parameter\-efficient alternative to existing PEFT methods\. To the best of our knowledge, we are the first method leading improvements driven by principled uncertainty modeling rather than increased model capacity\.
Table 2:Accuracy comparison among various PEFT methods on commonsense reasoning datasets\. Results for ChatGPT, LoRA, and DoRA are sourced fromHuanget al\.\([2025](https://arxiv.org/html/2605.08110#bib.bib39)\)\. The best performance within each LLM is indicated inbold, while the second best performance is highlighted inunderline\.
#### BaLoRA is Robust accross Prior Probabilities
In Figure[2](https://arxiv.org/html/2605.08110#S4.F2)we examine the sensitivity of BaLoRA to the choice of prior probabilityppin the KL divergence loss of equation \([6](https://arxiv.org/html/2605.08110#S3.E6)\) for Llama3 accross BoolQ, PIQA, SIQA and WinoG datasets \(see Appendix[B](https://arxiv.org/html/2605.08110#A2)for Llama2 and extra datasets\)\. Across both Llama\-2 and Llama\-3, performance remains largely stable over the full range of tested values, with larger values ofppshowing marginal accuracy gains across multiple sub\-tasks\. The variance in average accuracy acrossppvalues is narrow, indicating that BaLoRA is robust to this hyperparameter\.
Figure 2:Ablation study for LLaMA3 on the effect of prior probabilityppin the KL divergence loss formulation \(equation \([6](https://arxiv.org/html/2605.08110#S3.E6)\)\) across benchmark tasks\. Results demonstrate robustness across different values ofpp\.
### 4\.2Visual Perception
Having established BaLoRA’s advantage on language tasks, we turn to visual perception to probe the generality of our approach across domains\. We benchmark BaLoRA against LoRA and DoRA, the two most established low\-rank adaptation methods in the vision literature, by fine\-tuning a ViT\-L/16 pretrained on ImageNet\-21K\(Dosovitskiyet al\.,[2020](https://arxiv.org/html/2605.08110#bib.bib36)\)across four image classification benchmarks: CIFAR\-10, CIFAR\-100, Oxford\-IIIT Pets, and Oxford Flowers\-102\. We follow the standard experimental protocol ofDosovitskiyet al\.\([2020](https://arxiv.org/html/2605.08110#bib.bib36)\)with minor learning rate adjustments, applying adapters exclusively to the query and value projection matrices as is conventional in thetimmlibrary\. We setr=8r\{=\}8,α=16\\alpha\{=\}16for CIFAR datasets andr=16r\{=\}16,α=32\\alpha\{=\}32otherwise; full hyperparameter details are provided in Appendix[C](https://arxiv.org/html/2605.08110#A3)\. Beyond classification accuracy, we additionally report Expected Calibration Error \(ECE\), a principled measure of predictive reliability, to assess whether BaLoRA’s stochastic training objective yields more calibrated models than its deterministic counterparts\.
#### BaLoRA is Accurate and Calibrated on Image Classification
Table 3:Accuracy comparison among various PEFT methods on classification datasets forViT\-L/16 \(ImageNet12K\)\. Results for full finetuning are used for reference and are sourced from\(Dosovitskiyet al\.,[2020](https://arxiv.org/html/2605.08110#bib.bib36)\)\. The best performance within each ViT is indicated inbold, while the second best performance is highlighted inunderline\.Table 4:Expected calibration error comparison among various PEFT methods on classification datasets forViT\-L/16 \(ImageNet12K\)\. The best performance within each ViT is indicated inbold, while the second best performance is highlighted inunderline\. Performance is expressed as a percentage\.We report classification accuracy across four benchmarks in Table[3](https://arxiv.org/html/2605.08110#S4.T3)\. BaLoRA consistently ranks first among all PEFT methods, matching or approaching full fine\-tuning performance\. Most strikingly, BaLoRA achives state\-of\-the\-art performances for CIFAR100, surpassing even full fine\-tuned models; and it closes the gap entirely on Oxford Flowers\-102, matching the full fine\-tuning ceiling of99\.61%99\.61\\%accuracy\. Against the strongest PEFT baseline, DoRA, BaLoRA delivers consistent improvements across all four datasets\. These results demonstrate that BaLoRA’s Bayesian inductive bias is not specific to the language domain but transfers naturally to visual perception, providing a robust and parameter\-efficient alternative to deterministic low\-rank adaptation\.
Beyond predictive accuracy, a well\-calibrated model is essential for trustworthy deployment in real\-world settings\. Table[4](https://arxiv.org/html/2605.08110#S4.T4)reports the Expected Calibration Error across the four benchmarks\. BaLoRA achieves the lowest ECE on two out of four datasets and ranks second on the remaining two, consistently outperforming or matching both LoRA and DoRA without any post\-hoc calibration procedure\. This suggests that the stochastic adaptive injection and KL regularization jointly act as an implicit regularizer, discouraging overconfident predictions\.
#### BaLoRA adds minimal Computational Cost over LoRA
Table[5](https://arxiv.org/html/2605.08110#S4.T5)compares memory and compute overhead across methods during training111We do not report inference statistics as all models, BaLoRA included, can merge adapters weights with pre\-trained weights resulting in zero additional latency/memory cost\.\. BaLoRA introduces a modest increase in trainable parameters \(1\.14M vs\. 0\.80M for LoRA\) and a×\\times1\.29 FLOPs overhead relative to full fine\-tuning — a direct consequence of the stochastic sampling during the forward pass\. Peak memory consumption remains well below full fine\-tuning and DoRA, confirming that BaLoRA operates comfortably within the parameter\-efficient regime despite its Bayesian formulation\.
Table 5:Resources comparison during training among various PEFT methods on CIFAR100 dataset forViT\-L/16 \(ImageNet12K\)\. Results are conducted on a single H100 94Gb for one image input\.
### 4\.3Materials Property Prediction
After evaluating BaLoRA in language and vision domains, we extend to a scientific setting to assess predictive accuracy and uncertainty quantification, a feature that deterministic low\-rank methods lack in regression\-based settings\. We fine\-tune MOFTransformer\(Kanget al\.,[2023](https://arxiv.org/html/2605.08110#bib.bib17)\)on the QMOF dataset\(Rosenet al\.,[2021](https://arxiv.org/html/2605.08110#bib.bib34)\)for bandgap prediction, an important task in designing MOFs for energy storage applications\(Rosenet al\.,[2022](https://arxiv.org/html/2605.08110#bib.bib33)\)\. Based on preliminary results, we apply adapters only to MLP layers withr=64r\{=\}64andα=128\\alpha\{=\}128\. We compare BaLoRA to an ensemble of LoRA models, a strong UQ baseline requiring multiple adapters, while BaLoRA provides calibrated uncertainty from a single adapter\. Importantely, training multiple adapters requires multiple fine\-tuning runs, while BaLoRA only need one\. We evaluate accuracy using Mean Absolute Error \(MAE\) and uncertainty quality via Spearman correlation between predicted uncertainty and actual error\. Full hyperparameter details are provided in Appendix[B](https://arxiv.org/html/2605.08110#A2)\.
#### BaLoRA is Accurate and Calibrated on Bandgap Prediction
Figure[5](https://arxiv.org/html/2605.08110#A3.F5)reports MAE and Spearman correlation on the QMOF benchmark\. BaLoRA consistently outperforms DoRA and LoRA across both metrics\. In terms of accuracy, BaLoRA achieves an MAE of0\.2660\.266eV, narrowing the gap to full fine\-tuning \(0\.2240\.224eV\) while remaining significantly more parameter\-efficient \(using only6\.75%6\.75\\%of trainable parameters\)\. Notably, its predictions more closely match ground truth values in the low bandgap regime \(<2<2eV\), which is particularly important since lower bandgaps correspond to higher conductivity and are critical for applications such as photocatalysis\(Caoet al\.,[2023](https://arxiv.org/html/2605.08110#bib.bib35)\)\. For uncertainty quantification, BaLoRA achieves the highest Spearman correlation \(0\.3460\.346\) between predicted uncertainty and actual error, outperforming ensemble DoRA \(0\.2590\.259\) and LoRA \(0\.3090\.309\)\. Interestingly, we find that multiple forward passes at test time further improve uncertainty estimates without requiring additional training \(Figure[5](https://arxiv.org/html/2605.08110#A3.F5)\)\. However, all methods exhibit only moderate correlations, underscoring the intrinsic difficulty of reliable uncertainty estimation in this task\. This indicates that while BaLoRA improves calibration, bandgap prediction for MOFs remains a challenging setting for uncertainty\-aware modeling\.
Figure 3:Performance in eV among various PEFT methods on bandgap prediction usingMOFTransformeron QMOF dataset\.
## 5Conclusions
We presented BaLoRA, a Bayesian extension of Low\-Rank Adaptation that addresses two fundamental limitations of existing PEFT methods: the expressiveness gap relative to full fine\-tuning, and the absence of principled uncertainty quantification\. By introducing an input\-adaptive Bayesian reparametrization of the LoRA reduction matrix, BaLoRA encodes uncertainty during training via adaptive noise injection, which simultaneously acts as an implicit regularizer and enables calibrated predictions at test time\. A low\-rank local reparametrization trick ensures that the posterior distribution is sampled exactly in the low\-dimensional latent space, matching the computational scaling of standard LoRA and incurring no additional inference overhead in deterministic mode\. Across commonsense reasoning, visual perception, and materials property prediction, BaLoRA consistently outperforms LoRA and its deterministic variants while remaining within a comparable parameter and FLOPs budget\. On commonsense reasoning, BaLoRA achieves state\-of\-the\-art accuracy on the majority of sub\-tasks for both Llama\-2\-7B and Llama\-3\-8B, driven by principled uncertainty training rather than increased model capacity\. On image classification, it matches full fine\-tuning accuracy on Oxford Flowers\-102 and achieves strong calibration across all benchmarks without post\-hoc correction\. Finally, on bandgap prediction for metal\-organic frameworks, BaLoRA produces zero\-shot uncertainty estimates that correlate more strongly with model error than a trained ensemble of LoRA models, while requiring only a single fine\-tuning run\. These results suggest that Bayesian inductive biases, when carefully integrated into the geometry of low\-rank adapters, offer complementary and additive advantages over purely deterministic approaches\. We hope BaLoRA encourages broader adoption of uncertainty\-aware PEFT in scientific and safety\-critical applications, where reliable predictions matter as much as accurate ones\.
## References
- Intrinsic dimensionality explains the effectiveness of language model fine\-tuning\.InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing \(volume 1: long papers\),pp\. 7319–7328\.Cited by:[§2\.1](https://arxiv.org/html/2605.08110#S2.SS1.p2.1)\.
- T\. Bayes \(1763\)An Essay towards solving a Problem in the Doctrine of Chances\. By the late Rev\. Mr\. Bayes, FRS communicated by Mr\. Price, in a letter to John Canton, A\.M\.F\.R\.S\.\.Philosophical transactions of the Royal Society of London\.Cited by:[§2\.2](https://arxiv.org/html/2605.08110#S2.SS2.p1.1)\.
- C\. Blundell, J\. Cornebise, K\. Kavukcuoglu, and D\. Wierstra \(2015\)Weight Uncertainty in Neural Networks\.InInternational conference on machine learning,pp\. 1613–1622\.Cited by:[§2\.2](https://arxiv.org/html/2605.08110#S2.SS2.p1.1)\.
- C\. Bodnar, W\. P\. Bruinsma, A\. Lucic, M\. Stanley, A\. Allen, J\. Brandstetter, P\. Garvan, M\. Riechert, J\. A\. Weyn, H\. Dong,et al\.\(2025\)A foundation model for the earth system\.Nature641\(8065\),pp\. 1180–1187\.Cited by:[§1](https://arxiv.org/html/2605.08110#S1.p2.1)\.
- Z\. Cao, R\. Magar, Y\. Wang, and A\. Barati Farimani \(2023\)Moformer: self\-supervised transformer model for metal–organic framework property prediction\.Journal of the American Chemical Society145\(5\),pp\. 2958–2967\.Cited by:[§4\.3](https://arxiv.org/html/2605.08110#S4.SS3.SSS0.Px1.p1.7)\.
- D\. Coscia, P\. de Haan, and M\. Welling \(2025a\)BLIPs: bayesian learned interatomic potentials\.arXiv preprint arXiv:2508\.14022\.Cited by:[§1](https://arxiv.org/html/2605.08110#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08110#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.08110#S3.SS2.SSS0.Px1.p1.5),[§3\.2](https://arxiv.org/html/2605.08110#S3.SS2.p1.2),[§3](https://arxiv.org/html/2605.08110#S3.p1.1)\.
- D\. Coscia, M\. Welling, N\. Demo, and G\. Rozza \(2025b\)BARNN: a bayesian autoregressive and recurrent neural network\.InForty\-second International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.08110#S1.p2.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly,et al\.\(2020\)An image is worth 16x16 words: transformers for image recognition at scale\.arXiv preprint arXiv:2010\.11929\.Cited by:[§B\.2](https://arxiv.org/html/2605.08110#A2.SS2.SSS0.Px3.p1.4),[§B\.2](https://arxiv.org/html/2605.08110#A2.SS2.SSS0.Px3.p2.2),[§4\.2](https://arxiv.org/html/2605.08110#S4.SS2.p1.4),[Table 3](https://arxiv.org/html/2605.08110#S4.T3)\.
- Y\. Gal and Z\. Ghahramani \(2016\)Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning\.Ininternational conference on machine learning,pp\. 1050–1059\.Cited by:[§2\.2](https://arxiv.org/html/2605.08110#S2.SS2.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4\.1](https://arxiv.org/html/2605.08110#S4.SS1.p1.1)\.
- A\. Graves \(2011\)Practical Variational Inference for Neural Networks\.Advances in Neural Information Processing Systems24\.Cited by:[§2\.2](https://arxiv.org/html/2605.08110#S2.SS2.p1.1)\.
- G\. E\. Hinton and D\. Van Camp \(1993\)Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights\.InConference on Computational Learning Theory,Cited by:[§2\.2](https://arxiv.org/html/2605.08110#S2.SS2.p1.1)\.
- N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. De Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly \(2019\)Parameter\-efficient transfer learning for nlp\.InInternational conference on machine learning,pp\. 2790–2799\.Cited by:[§1](https://arxiv.org/html/2605.08110#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.08110#S2.SS1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§1](https://arxiv.org/html/2605.08110#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.08110#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.08110#S3.SS1.p1.6),[§4\.1](https://arxiv.org/html/2605.08110#S4.SS1.p1.1)\.
- Z\. Hu, L\. Wang, Y\. Lan, W\. Xu, E\. Lim, L\. Bing, X\. Xu, S\. Poria, and R\. Lee \(2023\)Llm\-adapters: an adapter family for parameter\-efficient fine\-tuning of large language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 5254–5276\.Cited by:[§B\.1](https://arxiv.org/html/2605.08110#A2.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.08110#S4.SS1.p1.1)\.
- Q\. Huang, T\. Ko, Z\. Zhuang, L\. Tang, and Y\. Zhang \(2025\)HiRA: parameter\-efficient hadamard high\-rank adaptation for large language models\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§B\.1](https://arxiv.org/html/2605.08110#A2.SS1.SSS0.Px2.p1.7),[§B\.1](https://arxiv.org/html/2605.08110#A2.SS1.SSS0.Px3.p1.4),[§2\.1](https://arxiv.org/html/2605.08110#S2.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.08110#S4.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.08110#S4.T2)\.
- T\. Jiang, S\. Huang, S\. Luo, Z\. Zhang, H\. Huang, F\. Wei, W\. Deng, F\. Sun, Q\. Zhang, D\. Wang,et al\.\(2024\)Mora: high\-rank updating for parameter\-efficient fine\-tuning\.arXiv preprint arXiv:2405\.12130\.Cited by:[§2\.1](https://arxiv.org/html/2605.08110#S2.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.08110#S4.SS1.p1.1)\.
- Y\. Kang, H\. Park, B\. Smit, and J\. Kim \(2023\)A multi\-modal pre\-training transformer for universal transfer learning in metal–organic frameworks\.Nature Machine Intelligence5\(3\),pp\. 309–318\.Cited by:[§B\.3](https://arxiv.org/html/2605.08110#A2.SS3.SSS0.Px1.p1.1),[§B\.3](https://arxiv.org/html/2605.08110#A2.SS3.SSS0.Px3.p1.8),[§1](https://arxiv.org/html/2605.08110#S1.p1.1),[§4\.3](https://arxiv.org/html/2605.08110#S4.SS3.p1.2)\.
- D\. P\. Kingma, T\. Salimans, and M\. Welling \(2015\)Variational Dropout and the Local Reparameterization Trick\.InAdvances in Neural Information Processing Systems,Cited by:[§2\.2](https://arxiv.org/html/2605.08110#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2605.08110#S3.SS3.p1.3)\.
- B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell \(2017\)Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles\.Advances in neural information processing systems30\.Cited by:[§2\.2](https://arxiv.org/html/2605.08110#S2.SS2.p1.1)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.InProceedings of the 2021 conference on empirical methods in natural language processing,pp\. 3045–3059\.Cited by:[§2\.1](https://arxiv.org/html/2605.08110#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.08110#S4.SS1.p1.1)\.
- H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023\)Visual instruction tuning\.International Conference on Neural Information Processing Systems36,pp\. 34892–34916\.Cited by:[§1](https://arxiv.org/html/2605.08110#S1.p1.1)\.
- S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen \(2024a\)Dora: weight\-decomposed low\-rank adaptation\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.08110#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.08110#S2.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.08110#S4.SS1.p1.1)\.
- X\. Liu, K\. Ji, Y\. Fu, W\. Tam, Z\. Du, Z\. Yang, and J\. Tang \(2022\)P\-tuning: prompt tuning can be comparable to fine\-tuning across scales and tasks\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 61–68\.Cited by:[§4\.1](https://arxiv.org/html/2605.08110#S4.SS1.p1.1)\.
- X\. Liu, Y\. Zheng, Z\. Du, M\. Ding, Y\. Qian, Z\. Yang, and J\. Tang \(2024b\)GPT understands, too\.AI open5,pp\. 208–215\.Cited by:[§2\.1](https://arxiv.org/html/2605.08110#S2.SS1.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§B\.1](https://arxiv.org/html/2605.08110#A2.SS1.SSS0.Px3.p1.4),[§B\.3](https://arxiv.org/html/2605.08110#A2.SS3.SSS0.Px3.p1.8)\.
- R\. K\. Mahabadi, S\. Ruder, M\. Dehghani, and J\. Henderson \(2021\)Parameter\-efficient multi\-task fine\-tuning for transformers via shared hypernetworks\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 565–576\.Cited by:[§2\.1](https://arxiv.org/html/2605.08110#S2.SS1.p1.1)\.
- C\. Qin, A\. Zhang, Z\. Zhang, J\. Chen, M\. Yasunaga, and D\. Yang \(2023\)Is chatgpt a general\-purpose natural language processing task solver?\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 1339–1384\.Cited by:[§1](https://arxiv.org/html/2605.08110#S1.p1.1)\.
- A\. Razdaibiedina, Y\. Mao, M\. Khabsa, M\. Lewis, R\. Hou, J\. Ba, and A\. Almahairi \(2023\)Residual prompt tuning: improving prompt tuning with residual reparameterization\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 6740–6757\.Cited by:[§2\.1](https://arxiv.org/html/2605.08110#S2.SS1.p1.1)\.
- A\. S\. Rosen, V\. Fung, P\. Huck, C\. T\. O’Donnell, M\. K\. Horton, D\. G\. Truhlar, K\. A\. Persson, J\. M\. Notestein, and R\. Q\. Snurr \(2022\)High\-throughput predictions of metal–organic framework electronic properties: theoretical challenges, graph neural networks, and data exploration\.npj Computational Materials8\(1\),pp\. 112\.Cited by:[§4\.3](https://arxiv.org/html/2605.08110#S4.SS3.p1.2)\.
- A\. S\. Rosen, S\. M\. Iyer, D\. Ray, Z\. Yao, A\. Aspuru\-Guzik, L\. Gagliardi, J\. M\. Notestein, and R\. Q\. Snurr \(2021\)Machine learning the quantum\-chemical properties of metal–organic frameworks for accelerated materials discovery\.Matter4\(5\),pp\. 1578–1597\.Cited by:[§B\.3](https://arxiv.org/html/2605.08110#A2.SS3.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.08110#S4.SS3.p1.2)\.
- N\. Shoghi, A\. Kolluru, J\. R\. Kitchin, Z\. W\. Ulissi, C\. L\. Zitnick, and B\. M\. Wood \(2024\)From molecules to materials: pre\-training large generalizable models for atomic property prediction\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.08110#S1.p1.1)\.
- H\. Touvron, M\. Cord, M\. Douze, F\. Massa, A\. Sablayrolles, and H\. Jégou \(2021\)Training data\-efficient image transformers & distillation through attention\.InInternational conference on machine learning,pp\. 10347–10357\.Cited by:[§B\.2](https://arxiv.org/html/2605.08110#A2.SS2.SSS0.Px3.p3.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§1](https://arxiv.org/html/2605.08110#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.08110#S4.SS1.p1.1)\.
- X\. Wang, L\. Aitchison, and M\. Rudolph \(2023\)LoRA ensembles for large language model fine\-tuning\.arXiv preprint arXiv:2310\.00035\.Cited by:[§2\.2](https://arxiv.org/html/2605.08110#S2.SS2.p1.1)\.
- M\. Welling and Y\. W\. Teh \(2011\)Bayesian Learning via Stochastic Gradient Langevin Dynamics\.InProceedings of the 28th international conference on machine learning \(ICML\-11\),pp\. 681–688\.Cited by:[§2\.2](https://arxiv.org/html/2605.08110#S2.SS2.p1.1)\.
- B\. M\. Wood, M\. Dzamba, X\. Fu, M\. Gao, M\. Shuaibi, L\. Barroso\-Luque, K\. Abdelmaqsoud, V\. Gharakhanyan, J\. R\. Kitchin, D\. S\. Levine,et al\.\(2025\)UMA: a family of universal models for atoms\.InInternational Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.08110#S1.p1.1),[§1](https://arxiv.org/html/2605.08110#S1.p2.1)\.
- T\. Xie and J\. C\. Grossman \(2018\)Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties\.Physical review letters120\(14\),pp\. 145301\.Cited by:[§B\.3](https://arxiv.org/html/2605.08110#A2.SS3.SSS0.Px3.p2.5)\.
Appendix
Table of Contents
## Appendix AProofs and Derivations
This Appendix Section is devoted to the formal proofs introduced in the main text\.
### A\.1Predictive Distribution and Uncertainty Estimates
We begin by deriving the posterior predictive distribution of the output induced by the stochastic LoRA update \(Figure[1](https://arxiv.org/html/2605.08110#S1.F1)\)\.
###### Proposition 1\(Predictive Distribution\)\.
Under the posterior
q\(𝝎A∣𝒙\)=∏i=1r∏j=1d𝒩\(θA;ij,α\(𝒙\)θA;ij2\),q\(\\bm\{\\omega\}\_\{A\}\\mid\\bm\{x\}\)=\\prod\_\{i=1\}^\{r\}\\prod\_\{j=1\}^\{d\}\\mathcal\{N\}\(\\theta\_\{A;ij\},\\alpha\(\\bm\{x\}\)\\theta\_\{A;ij\}^\{2\}\),\(12\)and with deterministic𝛉B\\bm\{\\theta\}\_\{B\}, the output
𝒚=𝜽0𝒙\+𝜽B𝝎A𝒙\\bm\{y\}=\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\omega\}\_\{A\}\\bm\{x\}\(13\)is Gaussian with
𝒚∼𝒩\(𝜽0𝒙\+𝜽B𝜽A𝒙,α\(𝒙\)𝜽Bdiag\(𝜽A2𝒙2\)𝜽B⊤\)\.\\bm\{y\}\\sim\\mathcal\{N\}\\Big\(\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\\bm\{x\},\\;\\alpha\(\\bm\{x\}\)\\,\\bm\{\\theta\}\_\{B\}\\,\\mathrm\{diag\}\(\\bm\{\\theta\}\_\{A\}^\{2\}\\bm\{x\}^\{2\}\)\\,\\bm\{\\theta\}\_\{B\}^\{\\top\}\\Big\)\.\(14\)
###### Proof\.
Let𝒛=𝝎A𝒙∈ℝr\\bm\{z\}=\\bm\{\\omega\}\_\{A\}\\bm\{x\}\\in\\mathbb\{R\}^\{r\}, whose components arezi=∑j=1dωA;ijxjz\_\{i\}=\\sum\_\{j=1\}^\{d\}\\omega\_\{A;ij\}\\,x\_\{j\}\. Since theωA;ij\\omega\_\{A;ij\}are independent Gaussian random variables, eachziz\_\{i\}is Gaussian, and thus𝒛\\bm\{z\}is Gaussian\. The moments are computed as follows\.
Mean\.Using𝔼\[ωA;ij\]=θA;ij\\mathbb\{E\}\[\\omega\_\{A;ij\}\]=\\theta\_\{A;ij\},
𝔼\[zi\]=∑j=1d𝔼\[ωA;ij\]xj=∑j=1dθA;ijxj=\(𝜽A𝒙\)i\.\\mathbb\{E\}\[z\_\{i\}\]=\\sum\_\{j=1\}^\{d\}\\mathbb\{E\}\[\\omega\_\{A;ij\}\]\\,x\_\{j\}=\\sum\_\{j=1\}^\{d\}\\theta\_\{A;ij\}\\,x\_\{j\}=\(\\bm\{\\theta\}\_\{A\}\\bm\{x\}\)\_\{i\}\.\(15\)Hence,
𝔼\[𝒛\]=𝜽A𝒙\.\\mathbb\{E\}\[\\bm\{z\}\]=\\bm\{\\theta\}\_\{A\}\\bm\{x\}\.\(16\)
Covariance\.Fori≠ki\\neq k, independence across rows implies
Cov\(zi,zk\)=0\.\\mathrm\{Cov\}\(z\_\{i\},z\_\{k\}\)=0\.\(17\)For the variance of each component,
Var\(zi\)=∑j=1dVar\(ωA;ij\)xj2=∑j=1dα\(𝒙\)θA;ij2xj2\.\\mathrm\{Var\}\(z\_\{i\}\)=\\sum\_\{j=1\}^\{d\}\\mathrm\{Var\}\(\\omega\_\{A;ij\}\)\\,x\_\{j\}^\{2\}=\\sum\_\{j=1\}^\{d\}\\alpha\(\\bm\{x\}\)\\theta\_\{A;ij\}^\{2\}\\,x\_\{j\}^\{2\}\.\(18\)Thus,
Cov\(𝒛\)=α\(𝒙\)diag\(∑j=1dθA;ij2xj2\)i=1r=α\(𝒙\)diag\(𝜽A2𝒙2\)\.\\mathrm\{Cov\}\(\\bm\{z\}\)=\\alpha\(\\bm\{x\}\)\\,\\mathrm\{diag\}\\\!\\Big\(\\sum\_\{j=1\}^\{d\}\\theta\_\{A;ij\}^\{2\}x\_\{j\}^\{2\}\\Big\)\_\{i=1\}^\{r\}=\\alpha\(\\bm\{x\}\)\\,\\mathrm\{diag\}\(\\bm\{\\theta\}\_\{A\}^\{2\}\\bm\{x\}^\{2\}\)\.\(19\)
Distribution ofy\\bm\{y\}\.Finally,
𝒚=𝜽0𝒙\+𝜽B𝒛,\\bm\{y\}=\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\bm\{z\},\(20\)which is an affine transformation of a Gaussian vector\. Therefore,
𝔼\[𝒚\]=𝜽0𝒙\+𝜽B𝜽A𝒙,\\mathbb\{E\}\[\\bm\{y\}\]=\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\\bm\{x\},\(21\)Cov\(𝒚\)=𝜽BCov\(𝒛\)𝜽B⊤\.\\mathrm\{Cov\}\(\\bm\{y\}\)=\\bm\{\\theta\}\_\{B\}\\,\\mathrm\{Cov\}\(\\bm\{z\}\)\\,\\bm\{\\theta\}\_\{B\}^\{\\top\}\.\(22\)
SubstitutingCov\(𝒛\)\\mathrm\{Cov\}\(\\bm\{z\}\)yields
𝒚∼𝒩\(𝜽0𝒙\+𝜽B𝜽A𝒙,α\(𝒙\)𝜽Bdiag\(𝜽A2𝒙2\)𝜽B⊤\)\.\\bm\{y\}\\sim\\mathcal\{N\}\\Big\(\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\\bm\{x\},\\;\\alpha\(\\bm\{x\}\)\\,\\bm\{\\theta\}\_\{B\}\\,\\mathrm\{diag\}\(\\bm\{\\theta\}\_\{A\}^\{2\}\\bm\{x\}^\{2\}\)\\,\\bm\{\\theta\}\_\{B\}^\{\\top\}\\Big\)\.\(23\)∎
Given the Gaussian posterior predictive derived above, the remaining challenge is efficient sampling\. Direct approaches require constructing and factorizing the full output covariance, which scales quadratically in the output dimension and is prohibitive in practice, as explained in the main text\. We next show how the low\-rank structure of the LoRA parameterization can be exploited to obtain an exact but computationally efficient sampling procedure\.
###### Proposition 2\(Low\-Rank Sampling\)\.
Let the posterior predictive distribution of𝐲\\bm\{y\}be
𝒚∼𝒩\(𝜽0𝒙\+𝜽B𝜽A𝒙,Σ\),\\bm\{y\}\\sim\\mathcal\{N\}\\Big\(\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\\bm\{x\},\\;\\Sigma\\Big\),\(24\)with covariance
Σ=α\(𝒙\)𝜽Bdiag\(𝜽A2𝒙2\)𝜽B⊤\.\\Sigma=\\alpha\(\\bm\{x\}\)\\,\\bm\{\\theta\}\_\{B\}\\,\\mathrm\{diag\}\(\\bm\{\\theta\}\_\{A\}^\{2\}\\bm\{x\}^\{2\}\)\\,\\bm\{\\theta\}\_\{B\}^\{\\top\}\.\(25\)Define𝐝\(𝐱\)=α\(𝐱\)𝛉A2𝐱2∈ℝr\\mathbf\{d\}\(\\bm\{x\}\)=\\alpha\(\\bm\{x\}\)\\,\\bm\{\\theta\}\_\{A\}^\{2\}\\bm\{x\}^\{2\}\\in\\mathbb\{R\}^\{r\}\. Then an exact sample from this distribution can be obtained as
𝒚=𝜽0𝒙\+𝜽B𝜽A𝒙\+𝜽B\(𝐝\(𝒙\)⊙ϵd\),ϵd∼𝒩\(𝟎,𝐈r\)\.\\bm\{y\}=\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\big\(\\sqrt\{\\mathbf\{d\}\(\\bm\{x\}\)\}\\odot\\bm\{\\epsilon\}\_\{d\}\\big\),\\quad\\bm\{\\epsilon\}\_\{d\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{r\}\)\.\(26\)
###### Proof\.
Letϵd∼𝒩\(𝟎,𝐈r\)\\bm\{\\epsilon\}\_\{d\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{r\}\)and define
𝒛~=𝜽A𝒙\+𝐝\(𝒙\)⊙ϵd\.\\tilde\{\\bm\{z\}\}=\\bm\{\\theta\}\_\{A\}\\bm\{x\}\+\\sqrt\{\\mathbf\{d\}\(\\bm\{x\}\)\}\\odot\\bm\{\\epsilon\}\_\{d\}\.\(27\)Sinceϵd\\bm\{\\epsilon\}\_\{d\}is standard Gaussian,𝒛~\\tilde\{\\bm\{z\}\}is Gaussian with
𝔼\[𝒛~\]=𝜽A𝒙,\\mathbb\{E\}\[\\tilde\{\\bm\{z\}\}\]=\\bm\{\\theta\}\_\{A\}\\bm\{x\},\(28\)Cov\(𝒛~\)=diag\(𝐝\(𝒙\)\)\.\\mathrm\{Cov\}\(\\tilde\{\\bm\{z\}\}\)=\\mathrm\{diag\}\(\\mathbf\{d\}\(\\bm\{x\}\)\)\.\(29\)
Now define
𝒚=𝜽0𝒙\+𝜽B𝒛~\.\\bm\{y\}=\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\tilde\{\\bm\{z\}\}\.\(30\)Since𝒚\\bm\{y\}is an affine transformation of a Gaussian vector, it is Gaussian with mean
𝔼\[𝒚\]=𝜽0𝒙\+𝜽B𝜽A𝒙,\\mathbb\{E\}\[\\bm\{y\}\]=\\bm\{\\theta\}\_\{0\}\\bm\{x\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\\bm\{x\},\(31\)and covariance
Cov\(𝒚\)=𝜽BCov\(𝒛~\)𝜽B⊤=𝜽Bdiag\(𝐝\(𝒙\)\)𝜽B⊤\.\\mathrm\{Cov\}\(\\bm\{y\}\)=\\bm\{\\theta\}\_\{B\}\\,\\mathrm\{Cov\}\(\\tilde\{\\bm\{z\}\}\)\\,\\bm\{\\theta\}\_\{B\}^\{\\top\}=\\bm\{\\theta\}\_\{B\}\\,\\mathrm\{diag\}\(\\mathbf\{d\}\(\\bm\{x\}\)\)\\,\\bm\{\\theta\}\_\{B\}^\{\\top\}\.\(32\)
Substituting𝐝\(𝒙\)=α\(𝒙\)𝜽A2𝒙2\\mathbf\{d\}\(\\bm\{x\}\)=\\alpha\(\\bm\{x\}\)\\,\\bm\{\\theta\}\_\{A\}^\{2\}\\bm\{x\}^\{2\}yields
Cov\(𝒚\)=α\(𝒙\)𝜽Bdiag\(𝜽A2𝒙2\)𝜽B⊤=Σ\.\\mathrm\{Cov\}\(\\bm\{y\}\)=\\alpha\(\\bm\{x\}\)\\,\\bm\{\\theta\}\_\{B\}\\,\\mathrm\{diag\}\(\\bm\{\\theta\}\_\{A\}^\{2\}\\bm\{x\}^\{2\}\)\\,\\bm\{\\theta\}\_\{B\}^\{\\top\}=\\Sigma\.\(33\)
Therefore,𝒚\\bm\{y\}has the desired distribution, and the proposed sampling procedure is exact\. ∎
Having derived an exact posterior predictive distribution and an efficient sampling procedure, we now consider two complementary inference modes derived from the same stochastic LoRA formulation\. The first is a*deterministic mode*, which replaces the stochastic reduction matrix with its posterior mean\. This corresponds to a maximum a posteriori \(MAP\)\-style approximation of the model and enables fast inference, as the resulting LoRA weights can be directly merged into the base model without requiring sampling\. The second is a*sampling mode*, where stochastic forward passes are performed using the posterior predictive distribution, enabling the estimation of predictive uncertainty via Monte Carlo sampling\.
#### Deterministic mode\.
In the deterministic setting, we replace the stochastic LoRA parameters with their posterior means\. Specifically, for each LoRA layer we obtain the adapted weights
𝜽adapted=𝜽0\+𝜽B𝜽A\.\\bm\{\\theta\}\_\{\\text\{adapted\}\}=\\bm\{\\theta\}\_\{0\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}\.\(34\)This recovers the standard LoRA formulation, where the low\-rank update𝜽B𝜽A\\bm\{\\theta\}\_\{B\}\\bm\{\\theta\}\_\{A\}is deterministically added to the base weights𝜽0\\bm\{\\theta\}\_\{0\}\. Importantly, this form allows the adapted weights to be precomputed and merged into the base model prior to inference, it does not require the inference networkα\\alpha, and thus it results in no additional computational overhead at test time\.
#### Sampling mode\.
In the sampling setting, we retain the stochasticity of the reduction matrix\. For each LoRA layer, we instead use
𝝎adapted=𝜽0\+𝜽B𝝎A,\\bm\{\\omega\}\_\{\\text\{adapted\}\}=\\bm\{\\theta\}\_\{0\}\+\\bm\{\\theta\}\_\{B\}\\bm\{\\omega\}\_\{A\},\(35\)where the output𝝎adapted𝒙\\bm\{\\omega\}\_\{\\text\{adapted\}\}\\bm\{x\}is sampled \(using Low\-Rank Sampling\) from the learned predictive distribution\. Performing multiple stochastic forward passess=1,…,Ss=1,\\dots,Syields a set of predictive outputs from which we can estimate the predictive mean and variance via Monte Carlo estimation:
𝔼^\[𝒚\]=1S∑s=1S𝒚\(s\),𝝈^2=1S∑s=1S\(𝒚\(s\)−𝔼^\[𝒚\]\)2\.\\hat\{\\mathbb\{E\}\}\[\\bm\{y\}\]=\\frac\{1\}\{S\}\\sum\_\{s=1\}^\{S\}\\bm\{y\}^\{\(s\)\},\\qquad\\hat\{\\bm\{\\sigma\}\}^\{2\}=\\frac\{1\}\{S\}\\sum\_\{s=1\}^\{S\}\\big\(\\bm\{y\}^\{\(s\)\}\-\\hat\{\\mathbb\{E\}\}\[\\bm\{y\}\]\\big\)^\{2\}\.\(36\)
Moreover, the predictive uncertainty can be further decomposed into epistemic and aleatoric components via law of total variance\.
Var\(𝒚∣𝒙\)=Var𝝎\(𝔼\[𝒚∣𝒙,𝝎\]\)⏟epistemic\+𝔼𝝎\[Var\(𝒚∣𝒙,𝝎\)\]⏟aleatoric\.\\mathrm\{Var\}\(\\bm\{y\}\\mid\\bm\{x\}\)=\\underbrace\{\\mathrm\{Var\}\_\{\\bm\{\\omega\}\}\\\!\\left\(\\mathbb\{E\}\[\\bm\{y\}\\mid\\bm\{x\},\\bm\{\\omega\}\]\\right\)\}\_\{\\text\{epistemic\}\}\+\\underbrace\{\\mathbb\{E\}\_\{\\bm\{\\omega\}\}\\\!\\left\[\\mathrm\{Var\}\(\\bm\{y\}\\mid\\bm\{x\},\\bm\{\\omega\}\)\\right\]\}\_\{\\text\{aleatoric\}\}\.\(37\)In general,𝒙\\bm\{x\}denotes the model input \(e\.g\., a token sequence for language models or a feature vector in classification settings\), processed by a neural backbone\. Stochasticity is introduced through the LoRA weights𝝎A\\bm\{\\omega\}\_\{A\}\. The epistemic component captures uncertainty over the adapted parameters, while the aleatoric component reflects the intrinsic predictive uncertainty of the base model conditioned on fixed weights\.
### A\.2KL\-divergence proof
During training we optimise a variational lower bound to jointly maximise predictive performance and control model complexity\. This is achieved by balancing the expected log\-likelihood with a KL regulariser between the variational posterior and the prior\. In this section we derive the latter\.
###### Proposition 3\(Closed\-form KL\-divergence\)\.
Let the variational posteriorq\(𝛚A∣𝐱\)q\(\\bm\{\\omega\}\_\{A\}\\mid\\bm\{x\}\)be as defined in \([2](https://arxiv.org/html/2605.08110#S3.E2)\) and the priorp\(𝛚A\)p\(\\bm\{\\omega\}\_\{A\}\)be as defined in \([5](https://arxiv.org/html/2605.08110#S3.E5)\)\. Then the KL divergence admits the closed form
DKL\[q\(𝝎A∣𝒙\)∥p\(𝝎A\)\]=12\(\(α\(𝒙\)\+1\)\(1\+p\)p\+logp1−p−logα\(𝒙\)−1\),D\_\{KL\}\\\!\\left\[q\(\\bm\{\\omega\}\_\{A\}\\mid\\bm\{x\}\)\\,\\\|\\,p\(\\bm\{\\omega\}\_\{A\}\)\\right\]=\\frac\{1\}\{2\}\\left\(\\frac\{\(\\alpha\(\\bm\{x\}\)\+1\)\(1\+p\)\}\{p\}\+\\log\\frac\{p\}\{1\-p\}\-\\log\\alpha\(\\bm\{x\}\)\-1\\right\),which depends only onα\\alphaand the dropout \(prior\) ratepphyperparameter\.
###### Proof\.
Since bothq\(𝝎A∣𝒙\)q\(\\bm\{\\omega\}\_\{A\}\\mid\\bm\{x\}\)andp\(𝝎A\)p\(\\bm\{\\omega\}\_\{A\}\)factorise over entries, the KL divergence decomposes as
DKL\[q\(𝝎A∣𝒙\)∥p\(𝝎A\)\]=∑i=1r∑j=1dDKL\[q\(𝝎A;ij∣𝒙\)∥p\(𝝎A;ij\)\]\.D\_\{KL\}\\\!\\left\[q\(\\bm\{\\omega\}\_\{A\}\\mid\\bm\{x\}\)\\,\\\|\\,p\(\\bm\{\\omega\}\_\{A\}\)\\right\]=\\sum\_\{i=1\}^\{r\}\\sum\_\{j=1\}^\{d\}D\_\{KL\}\\\!\\left\[q\(\\bm\{\\omega\}\_\{A;ij\}\\mid\\bm\{x\}\)\\,\\\|\\,p\(\\bm\{\\omega\}\_\{A;ij\}\)\\right\]\.\(38\)
From \([2](https://arxiv.org/html/2605.08110#S3.E2)\) and \([5](https://arxiv.org/html/2605.08110#S3.E5)\), we identify
q\(𝝎A;ij∣𝒙\)=𝒩\(θA;ij,α\(𝒙\)θA;ij2\),p\(𝝎A;ij\)=𝒩\(0,p1−pθA;ij2\)\.q\(\\bm\{\\omega\}\_\{A;ij\}\\mid\\bm\{x\}\)=\\mathcal\{N\}\\\!\\big\(\\theta\_\{A;ij\},\\,\\alpha\(\\bm\{x\}\)\\theta\_\{A;ij\}^\{2\}\\big\),\\quad p\(\\bm\{\\omega\}\_\{A;ij\}\)=\\mathcal\{N\}\\\!\\big\(0,\\,\\tfrac\{p\}\{1\-p\}\\theta\_\{A;ij\}^\{2\}\\big\)\.
Using the Gaussian KL formula and substitutingσq2=α\(𝒙\)θA;ij2\\sigma\_\{q\}^\{2\}=\\alpha\(\\bm\{x\}\)\\theta\_\{A;ij\}^\{2\}andσp2=p1−pθA;ij2\\sigma\_\{p\}^\{2\}=\\frac\{p\}\{1\-p\}\\theta\_\{A;ij\}^\{2\}, the factorθA;ij2\\theta\_\{A;ij\}^\{2\}cancels in all terms, yielding
DKL\[q\(𝝎A;ij∣𝒙\)∥p\(𝝎A;ij\)\]=12\(α\(𝒙\)\+1p1−p−1\+logp1−p−logα\(𝒙\)\)\.D\_\{KL\}\\\!\\left\[q\(\\bm\{\\omega\}\_\{A;ij\}\\mid\\bm\{x\}\)\\,\\\|\\,p\(\\bm\{\\omega\}\_\{A;ij\}\)\\right\]=\\frac\{1\}\{2\}\\left\(\\frac\{\\alpha\(\\bm\{x\}\)\+1\}\{\\frac\{p\}\{1\-p\}\}\-1\+\\log\\frac\{p\}\{1\-p\}\-\\log\\alpha\(\\bm\{x\}\)\\right\)\.\(39\)
Summing over all\(i,j\)\(i,j\)gives the final expression in \([6](https://arxiv.org/html/2605.08110#S3.E6)\)\. ∎
In practice, we use a*normalized*version of this KL term by dividing by the number of parametersrdrdand KL maximum value, so that the regularisation strength is independent of the network width and the maximum KL value equals one\.
## Appendix BExperimental Details and Hyperparameters
This Appendix Section is devoted to dataset, metrics and architecture hyperparameters used in the main text for experiments\. All our models have been trained on 94GB H100 Nvidia GPUs\. All models are trained on the training dataset and evaluated on the test dataset\. The validation dataset \(if available\) is used to select the best checkpoint otherwise the last training checkpoint is used\.
### B\.1Commonsense Reasoning
The commonsense reasoning task covers 8 sub\-tasks \(BoolQ, PIQA, SIQA, ARC\-c, ARC\-e, OBQA, HellaS, and WinoG\), each with its own predefined train/test split\. The goal of the task is to correctly answer a given multiple\-choice question\. The models are trained and evaluated as sequence\-based models\.
#### Dataset Generation and Availability\.
We adopt the benchmark protocol ofHuet al\.\[[2023](https://arxiv.org/html/2605.08110#bib.bib31)\], directly using their released dataset construction and evaluation setup without modification\. In particular, we use their Commonsense170K fine\-tuning data, which was constructed by formatting the training sets of BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC\-e, ARC\-c, and OBQA with predefined task\-specific templates\. Each dataset is treated as a separate commonsense reasoning task, and all examples are converted into a unified structured prompt consisting of a task description, followed by the input context and multiple\-choice options\. The datasets and propt\-templates are publically available at:[https://github\.com/AGI\-Edgerunners/LLM\-Adapters](https://github.com/AGI-Edgerunners/LLM-Adapters)\.
#### Metrics\.
The main evaluation metric is accuracy\. Accuracy is defined as the proportion of correctly answered queries over the total number of queries in the evaluation set\. In practice, model predictions are obtained via autoregressive generation with nucleus and beam decoding\. Specifically, we use temperature sampling with temperature0\.10\.1, top\-ppsampling withp=0\.75p=0\.75, top\-kkfiltering withk=40k=40, and beam search with44beams, as done inHuanget al\.\[[2025](https://arxiv.org/html/2605.08110#bib.bib39)\]\. The model generates up to 32 new tokens conditioned on the input prompt\. The generated sequence is then post\-processed by extracting the model’s response segment and parsing the first occurrence of a task\-specific keyword using a deterministic rule\-based function\. This extracted token is treated as the predicted labely^i\\hat\{y\}\_\{i\}\. If no valid keyword is found, the prediction is marked as incorrect\. Accuracy is then computed over the full evaluation set as
Acc=1N∑i=1N𝕀\[y^i=yi\],\\mathrm\{Acc\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\[\\hat\{y\}\_\{i\}=y\_\{i\}\],\(40\)whereNNis the number of samples,yiy\_\{i\}is the ground\-truth label, and𝕀\[⋅\]\\mathbb\{I\}\[\\cdot\]is the indicator function\.
#### Architecture and Hyperparameters\.
We adopt the same experimental setup asHuanget al\.\[[2025](https://arxiv.org/html/2605.08110#bib.bib39)\], training for 3 epochs using the AdamW optimizer\[Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.08110#bib.bib5)\]with a linear learning\-rate scheduler and a warmup of 100 gradient steps, an effective batch size of 16, a learning rate of1×10−41\\times 10^\{\-4\}for Llama 3 and2×10−42\\times 10^\{\-4\}for Llama 2, and no weight decay\. The pretrained backbone model is loaded in 8\-bit quantization to reduce memory consumption while preserving inference quality, and training is performed in mixed precision \(bf16\)\. FollowingHuanget al\.\[[2025](https://arxiv.org/html/2605.08110#bib.bib39)\], we apply LoRA adapters to the attention and MLP projection matrices using rank3232and lora alpha of6464\.
Our BaLoRA inference network predicts layer\-wise uncertainty coefficients conditioned on the current input sequence\. It consists of a frozen transformer encoder \(i\.e\., the pretrained backbone model, either Llama 2 or 3\) used solely for feature extraction, followed by a lightweight MLP that maps the final hidden representation of the last token to a set of coefficients over LoRA layers\. The forward computation can be written as
𝐱→Frozen Transformer→hlast→MLP\(d→256→256→L\)→Softplus→𝜶,\\mathbf\{x\}\\rightarrow\\text\{Frozen Transformer\}\\rightarrow h\_\{\\text\{last\}\}\\rightarrow\\text\{MLP\}\(d\\rightarrow 256\\rightarrow 256\\rightarrow L\)\\rightarrow\\text\{Softplus\}\\rightarrow\\bm\{\\alpha\},\(41\)whereLLdenotes the number of LoRA layers and𝜶∈ℝL\\bm\{\\alpha\}\\in\\mathbb\{R\}^\{L\}are the predicted coefficients\. The transformer encoder is kept entirely frozen and serves only as a feature extractor for the current input sequence, while all trainable parameters reside in the MLP\.
### B\.2Visual Perception
The visual perception task covers 4 image classification benchmarks \(CIFAR\-10, CIFAR\-100, Oxford\-IIIT Pets, and Oxford Flowers\-102\), each with its own standard train/test split\. The goal of the task is to correctly classify a given input image into one of the predefined categories\. The models are fine\-tuned and evaluated using a Vision Transformer backbone with a linear classification head\.
#### Dataset Generation and Availability\.
We evaluate BaLoRA on four standard image classification benchmarks: CIFAR\-10, CIFAR\-100, Oxford\-IIIT Pets, and Oxford Flowers\-102\. We follow the standard dataset splits provided by each benchmark, using the official training and test partitions without modification\. All models are fine\-tuned on images resized to224×224224\\times 224and normalized using ImageNet mean and standard deviation\. Data augmentation during training includes random horizontal flipping, random rotation, and color jittering, while evaluation is performed without augmentation\.
#### Metrics\.
We report Top\-1 classification accuracy and Expected Calibration Error \(ECE\) as evaluation metrics\. Accuracy is computed as the fraction of correctly classified samples over the full test set:
Acc=1N∑i=1N𝕀\[y^i=yi\],\\mathrm\{Acc\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\big\[\\hat\{y\}\_\{i\}=y\_\{i\}\\big\],\(42\)whereNNis the number of test samples,yiy\_\{i\}is the ground\-truth label,y^i\\hat\{y\}\_\{i\}is the predicted label, and𝕀\[⋅\]\\mathbb\{I\}\[\\cdot\]is the indicator function\.
ECE is computed using 15 bins with L1 normalization\. LetBmB\_\{m\}denote the set of indices in binmm,acc\(Bm\)\\mathrm\{acc\}\(B\_\{m\}\)the empirical accuracy in that bin, andconf\(Bm\)\\mathrm\{conf\}\(B\_\{m\}\)the average predicted confidence\. Then:
ECE=∑m=115\|Bm\|N\|acc\(Bm\)−conf\(Bm\)\|\.\\mathrm\{ECE\}=\\sum\_\{m=1\}^\{15\}\\frac\{\|B\_\{m\}\|\}\{N\}\\left\|\\mathrm\{acc\}\(B\_\{m\}\)\-\\mathrm\{conf\}\(B\_\{m\}\)\\right\|\.\(43\)
#### Architecture and Hyperparameters\.
We optimize using AdamW with learning rate10−310^\{\-3\}, no weight decay, batch size512512, and linear scheduling with 10% warmup\. Gradient clipping is applied with a maximum norm of 1\.0\. All models are trained in mixed precision \(bf16\) using 8\-bit loading for the backbone encoder\. We follow the PEFT protocol ofDosovitskiyet al\.\[[2020](https://arxiv.org/html/2605.08110#bib.bib36)\]and apply LoRA adapters to the query and value projection matrices of all transformer attention layers, as implemented in thetimmlibrary\. We use rankr∈\{8,16\}r\\in\\\{8,16\\\}and scaling factorα∈\{16,32\}\\alpha\\in\\\{16,32\\\}depending on dataset complexity, with no additional modifications to the pretrained backbone\.
Our backbone model is a Vision Transformer \(ViT\-L/16\) pretrained on ImageNet\-21K\[Dosovitskiyet al\.,[2020](https://arxiv.org/html/2605.08110#bib.bib36)\]\. The model processes images as sequences of16×1616\\times 16patches and produces a sequence of token embeddings of dimensiondd\. A classification token \(CLS\) is prepended to the patch sequence and used for downstream prediction\.
The BaLoRA inference network uses a second Vision Transformer encoder as a frozen feature extractor\. This encoder shares the same architectural family as the backbone \(ViT\), but is instantiated independently and loaded in 8\-bit precision for memory efficiency\. Unlike the backbone, this encoder is never updated and serves only to extract a global image representation from the CLS token\. We ablate \(see Figure[4](https://arxiv.org/html/2605.08110#A3.F4)\) several encoder choices for this module and select DeiT\-Base\[Touvronet al\.,[2021](https://arxiv.org/html/2605.08110#bib.bib4)\], distilled on ImageNet\-1k, as it performs best on validation\. In particular, given an input image𝐱\\mathbf\{x\}, the inference newtork computes:
𝐱→Frozen ViT Encoder→hCLS∈ℝd→MLP\(d→256→256→L\)→Softplus→𝜶,\\mathbf\{x\}\\rightarrow\\text\{Frozen ViT Encoder\}\\rightarrow h\_\{\\text\{CLS\}\}\\in\\mathbb\{R\}^\{d\}\\rightarrow\\text\{MLP\}\(d\\rightarrow 256\\rightarrow 256\\rightarrow L\)\\rightarrow\\text\{Softplus\}\\rightarrow\\bm\{\\alpha\},whereLLis the number of LoRA layers\. The ViT encoder is kept entirely frozen and serves only as a feature extractor for the current input image, while all trainable parameters reside in the MLP\.
### B\.3Materials Property Prediction
The materials property prediction task targets bandgap regression on the QMOF database\. The goal of the task is to predict the DFT\-computed electronic bandgap \(in eV\) of a given metal–organic framework structure\. The models are fine\-tuned and evaluated using a multimodal MOFTransformer backbone that jointly processes crystal graph and three\-dimensional energy grid representations\.
#### Dataset Generation and Availability\.
We evaluate BaLoRA on the QMOF bandgap prediction task using the QMOF database\[Rosenet al\.,[2021](https://arxiv.org/html/2605.08110#bib.bib34)\], a standard regression benchmark for metal–organic framework \(MOF\) property prediction\. The dataset consists of DFT\-computed electronic bandgap values \(in eV\) for crystalline MOF structures\. We follow the standard train/validation/test split provided by the benchmark without modification\. The dataset is publicly available through themoftransformerpackage\[Kanget al\.,[2023](https://arxiv.org/html/2605.08110#bib.bib17)\]\.
#### Metrics\.
The primary evaluation metric is Mean Absolute Error \(MAE\), computed between model predictions and ground\-truth bandgap values on the test set:
MAE=1N∑i=1N\|y^i−yi\|,\\mathrm\{MAE\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\lvert\\hat\{y\}\_\{i\}\-y\_\{i\}\\rvert,\(44\)whereNNis the number of test samples,yiy\_\{i\}is the ground\-truth bandgap, andy^i\\hat\{y\}\_\{i\}is the predicted value\. Predictions are denormalized using the training\-set meanμ=2\.0899\\mu=2\.0899and standard deviationσ=1\.1295\\sigma=1\.1295before computing the metric\. For uncertainty estimation, we additionally assess uncertainty calibration quality\. Specifically, we performT=100T=100stochastic forward passes at test time, compute the per\-sample predictive variance, and report the Spearman rank correlation between the predictive variance and the squared prediction error\. A higher correlation indicates that the model’s uncertainty estimates are better aligned with its actual errors, reflecting more reliable uncertainty quantification\.
#### Architecture and Hyperparameters\.
The backbone model is MOFTransformer\[Kanget al\.,[2023](https://arxiv.org/html/2605.08110#bib.bib17)\], a multimodal transformer that jointly processes crystal graph and three\-dimensional energy grid representations of MOF structures\. Following preliminary experiments, we apply LoRA adapters to the MLP projection matrices of all transformer blocks, using rankr=64r=64and scaling factorα=128\\alpha=128\. Training is performed for2020epochs using the AdamW optimizer\[Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.08110#bib.bib5)\]with learning rate5×10−45\\times 10^\{\-4\}, no weight decay, batch size3232, and a linear learning\-rate schedule with5%5\\%warmup\. Gradient clipping is applied with a maximum norm of1\.01\.0, and training uses mixed precision \(bf16\)\. For LoRA and DoRA baselines, we apply a dropout rate of0\.10\.1on the adapter weights; for BaLoRA, dropout is disabled\. The regression head and all adapter weights are trained end\-to\-end using an L1 loss on the normalized targets\.
The BaLoRA inference network predicts layer\-wise uncertainty coefficients conditioned on the input MOF structure\. It reuses the frozen graph and volume embedding modules from the pretrained backbone as feature extractors\. Given an input sample, graph token embeddings are obtained from the frozen CGCNN\[Xie and Grossman,[2018](https://arxiv.org/html/2605.08110#bib.bib3)\]encoder and aggregated via masked mean pooling over valid atom positions\. The resulting graph\-level representation is summed with the frozen volume embedding and passed through a lightweight MLP head:
𝐱→Frozen Graph Encoder→h¯graph⊕hvol→MLP\(d→d/2→L\)→Softplus→𝜶,\\mathbf\{x\}\\rightarrow\\text\{Frozen Graph Encoder\}\\rightarrow\\bar\{h\}\_\{\\text\{graph\}\}\\oplus h\_\{\\text\{vol\}\}\\rightarrow\\text\{MLP\}\(d\\rightarrow d/2\\rightarrow L\)\\rightarrow\\text\{Softplus\}\\rightarrow\\bm\{\\alpha\},\(45\)whereLLdenotes the number of LoRA layers,⊕\\oplusdenotes element\-wise addition, and𝜶∈ℝL\\bm\{\\alpha\}\\in\\mathbb\{R\}^\{L\}are the predicted coefficients\. The BaLoRA KL divergence term uses a prior dropout probability of0\.10\.1\. The graph and volume embedding parameters are kept entirely frozen; all trainable parameters reside in the MLP head\.
## Appendix CAdditional Results
In this section we present additional ablation studies and convergence analyses that complement the main experimental results\. We first examine the sensitivity of BaLoRA to the prior probability hyperparameter on commonsense reasoning, then ablate the choice of frozen encoder in the inference network for visual perception, and finally analyze the convergence of uncertainty estimates as a function of Monte Carlo steps on materials property prediction\.
### C\.1Additional Results on Commonsense Reasoning
Table[6](https://arxiv.org/html/2605.08110#A3.T6)reports the effect of the prior probabilityppin the BaLoRA KL divergence loss \(Equation \([6](https://arxiv.org/html/2605.08110#S3.E6)\)\) across all eight commonsense reasoning benchmarks\. For both Llama\-2\-7B and Llama\-3\-8B, accuracy remains stable across a wide range of prior values \(p∈\{0\.2,0\.4,0\.6,0\.8\}p\\in\\\{0\.2,0\.4,0\.6,0\.8\\\}\), with no single setting consistently dominating across all tasks\. On Llama\-2\-7B, the best average performance is obtained atp=0\.6p=0\.6, while on Llama\-3\-8B,p=0\.8p=0\.8yields the highest scores on most benchmarks\. Importantly, the spread between the best and worst prior settings is modest in all cases, indicating that BaLoRA is robust to this hyperparameter and does not require expensive tuning ofppin practice\.
Table 6:Ablation study on the effect of prior probabilityppin the KL divergence loss formulation \(equation \([6](https://arxiv.org/html/2605.08110#S3.E6)\)\) across benchmark tasks\. Results demonstrate robustness across different values ofpp\.
### C\.2Additional Results on Vision Perception
Figure[4](https://arxiv.org/html/2605.08110#A3.F4)ablates the choice of frozen encoder used in the BaLoRA inference network for the visual perception experiments\. We compare four Vision Transformer variants of increasing capacity: ViT\-Tiny, ViT\-Small, ViT\-Medium \(DeiT\-Base\), and ViT\-Large \(ViT\-L/16\)\. Across all four classification benchmarks, both accuracy \(top\) and ECE \(bottom\) are remarkably consistent regardless of encoder size\. These results suggest that the BaLoRA inference network does not require a large or expensive encoder to produce well\-calibrated uncertainty coefficients; even the smallest ViT\-Tiny variant yields competitive performance, underscoring the lightweight nature of the approach\.
Figure 4:Ablation study of posterior network encoders on classification performance\. We compare Vision Transformer variants:ViT\-Large\(ViT\-L/16 on ImageNet\-21k\),ViT\-Medium\(DeiT\-Base, distilled on ImageNet\-1k\),ViT\-Small\(DeiT\-Small, distilled on ImageNet\-1k\), andViT\-Tiny\(DeiT\-Tiny, distilled on ImageNet\-1k\)\.
### C\.3Additional Results on Materials Property Prediction
Figure[5](https://arxiv.org/html/2605.08110#A3.F5)shows the convergence behavior of BaLoRA on the QMOF bandgap prediction task as a function of the number of Monte Carlo \(MC\) forward passes at test time\. The left panel reports MAE \(in eV\), which remains stable at approximately0\.2670\.267eV across all MC budgets from1010to100100steps, with narrow confidence bands indicating low variance across seeds\. This confirms that the predictive mean is already well\-estimated with a modest number of stochastic passes, and additional MC steps do not degrade accuracy\. The right panel reports the Spearman rank correlation between predictive variance and squared error, a measure of uncertainty calibration quality\. Here, a clear upward trend is observed: the correlation improves from approximately0\.3150\.315at1010MC steps to roughly0\.3470\.347at100100steps, with the most rapid gains occurring between1010and5050steps before plateauing\. This indicates that while a small MC budget suffices for accurate point predictions, a moderately larger number of forward passes \(≥50\{\}\\geq 50\) is beneficial for obtaining well\-ranked uncertainty estimates\. The shaded confidence intervals widen slightly at higher MC budgets, reflecting increased seed\-level variability in the correlation statistic, but the overall trend is monotonically increasing\. Together, these results demonstrate that BaLoRA provides stable predictions and progressively improving uncertainty quantification at test time without any retraining\.
Figure 5:Performance convergence in eV for BaLoRA on bandgap prediction usingMOFTransformeron QMOF dataset\. The curve is obtained with test\-time multiple forward passes \(MC steps\)\. No retraining is performed, uncertainty estimates are obtained purely through stochastic forward passes\. Shaded regions denote ±1 standard deviation across three random seeds\.Similar Articles
Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
Hybrid-LoRA proposes a framework that selectively applies full fine-tuning to a small subset of modules while using LoRA for the rest, achieving performance near full fine-tuning with significantly lower computational cost. Experiments show improvements of up to 5.65% over existing parameter-efficient baselines.
FoRA: Fisher-orthogonal Rank Adaptation for Parameter-Efficient Fine-Tuning
FoRA introduces a parameter-efficient fine-tuning method that selects task-informative layers via Fisher scores and trains LoRA down-projections on the Stiefel manifold, reducing parameters while preserving accuracy.
Parameter-Efficient Fine-Tuning with Learnable Rank
Researchers from Adelaide University introduce LR-LoRA (Learnable Rank LoRA), a parameter-efficient fine-tuning method that dynamically learns the adapter rank for each transformer layer during training rather than using a fixed global rank. LR-LoRA achieves state-of-the-art performance on language understanding and commonsense reasoning benchmarks, outperforming fixed-rank LoRA baselines.
CSULoRA: Closest Safe Update Low-Rank Adaptation
CSULoRA is a post-hoc method for correcting trained LoRA adapters to preserve safety alignment while maintaining utility, using closest safe update estimation.
AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation
AdaPreLoRA is a novel LoRA optimizer that uses Adafactor diagonal Kronecker preconditioning to improve factor-space updates while maintaining low memory usage, demonstrating competitive performance across various LLMs and tasks.