Dropping learning rate fixed my Qlora fine-tune more than anything else i tried

Reddit r/LocalLLaMA 05/14/26, 12:40 PM News

fine-tuning qlora learning-rate llama-3-1 classification overfitting hyperparameter-tuning

Summary

A user found that reducing the learning rate from 2e-4 to 1e-4 significantly improved QLoRA fine-tuning of Llama 3.1 8B on a small dataset (8k samples), preventing overfitting and leading to better evaluation results.

Been fine-tuning llama 3.1 8b with Qlora for a classification task using about 8k samples. I was getting bad eval results for a while and kept thinking something was wrong with my data. Tried cleaning the dataset, tried different prompt templates, messed with rank and alpha. Nothing realy changed. Dropped the learning rate from 2e-4 to 1e-4 and bumped epochs from 3 to 5. Ran it on a 5090 I rent on Hyperai since our lab machines are always booked. Completley different results. Same data, same everything else. 2e-4 is just too agressive when your dataset is that small. The model overfits in the first epoch and then just goes in circles for the rest of training. Lower lr gave it more room to converge without blowing past everything. Also ended up cutting about a third of my dataset, mostly mislabeled and ambiguous stuff. Eval got better with less data which yeah yeah everyone says that but its different when you see the numbers yourself lol 2e-4 is the default everywhere and i dont think it works well below a certain size.

Original Article

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

arXiv cs.LG

Hybrid-LoRA proposes a framework that selectively applies full fine-tuning to a small subset of modules while using LoRA for the rest, achieving performance near full fine-tuning with significantly lower computational cost. Experiments show improvements of up to 5.65% over existing parameter-efficient baselines.

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

arXiv cs.LG

Researchers from AMD propose Recover-LoRA, a method that uses low-rank adaptation with knowledge distillation on synthetic data to recover accuracy lost from aggressive 2-bit quantization of LLMs, achieving 80–95% accuracy recovery on 9 of 12 benchmarks for Qwen3-4B using only 10k synthetic samples.

How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions

arXiv cs.AI

This paper presents a deployment-focused study comparing LoRA fine-tuning of 24 model variants (270M–8B parameters) for merchant information extraction from financial transaction strings. The authors find that smaller models like Qwen 3.5 4B achieve 96.6% F1, within 0.35 points of the 8B baseline, while offering significant reductions in latency and cost.

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

arXiv cs.CL

This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.

QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning