MiCA is now part of Hugging Face PEFT

Reddit r/LocalLLaMA Tools

Summary

MiCA (Minor Component Adaptation), a new fine-tuning method that initializes adapters in the minor singular subspace for better knowledge uptake and less forgetting, has been merged into the Hugging Face PEFT library. It is available via the PEFT main branch and integrates through the existing LoRA interface with init_lora_weights='mica'.

Glad to share that MiCA, short for Minor Component Adaptation, has now been merged into the HuggingFace PEFT library. It is not yet included in the latest PyPI release, but you can already install it directly from PEFT main: pip install --upgrade git+https://github.com/huggingface/peft.git@main Then using MiCA is minimal: from peft import LoraConfig, get_peft_model config = LoraConfig( init_lora_weights="mica", r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], task_type="CAUSAL_LM", ) model = get_peft_model(base_model, config) model.print_trainable_parameters() That’s it. MiCA is exposed through the existing LoRA interface via: init_lora_weights="mica" The idea behind MiCA is simple: instead of adapting along the dominant singular directions of a pretrained weight matrix, MiCA uses the minor singular subspace. For a weight matrix: W = U Σ Vᵀ MiCA initializes: B = U[:, -r:] A = 0 So the adapter starts as a no-op, because B A = 0 The base model output is preserved exactly at initialization. During training, MiCA keeps B frozen and only trains A. Why is this useful? The intuition is that the major singular directions already encode much of the pre-trained model’s existing behavior. The minor directions are less used by the original model and may provide a more plastic subspace for injecting new knowledge. In our experiments, MiCA showed in average over two experiments and three models: about 90% higher knowledge uptake on average about 20% less catastrophic forgetting about 80% fewer trainable parameters compared with LoRA in the tested setup See the paper for the full experimental details. A practical rule of thumb: If you have a LoRA setup that works well, try MiCA with: r_mica ≈ r_lora / 2 learning_rate_mica ≈ 2 × learning_rate_lora Because MiCA trains only one of the two LoRA matrices, you often need fewer parameters and can use a somewhat higher learning rate. Best practice: MiCA is mainly intended for continued pretraining / domain-adaptive pretraining. A recommended workflow is: Start from the base model, not the instruct/chat model. Train the MiCA adapter on domain text. Merge the adapter into the model. Use the merged model as the adapted base for later instruction/chat tuning. In many cases, merging or transferring the adapter into the corresponding instruct/chat model can work better; see the MiCA paper for details. We tested MiCA primarily for continued pretraining and supervised fine-tuning. Early RL results look promising. Instruction fine-tuning alone was not the most useful setting in our experiments. Huge thanks to Sebastian Raschka for the collaboration, and to the Hugging Face team (Lewis Tunstal and Benjamin Bossan) for review and integration. Preprint: https://arxiv.org/abs/2604.01694 https://preview.redd.it/rbqi05lrb6ah1.png?width=1672&format=png&auto=webp&s=0f62e0f43b3926eb6ef0079fcd1fe4af38f1b831
Original Article

Similar Articles

MTP PR Merged!!!

Reddit r/LocalLLaMA

A pull request for MTP (likely a model training pipeline or similar) related to LLaMA models has been merged, marking a milestone.