Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training
Summary
ART (Art-based Reinforcement Training) enables parameter-efficient fine-tuning of frozen multimodal LLMs by optimizing raw visual input via gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs for high-throughput engines like vLLM.
View Cached Full Text
Cached at: 06/11/26, 01:37 PM
Paper page - Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training
Source: https://huggingface.co/papers/2606.11854
Abstract
ART enables parameter-efficient fine-tuning of frozen multimodal language models by optimizing raw visual input through gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs.
There are two mainParameter-Efficient Fine-Tuning(PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers,Soft Promptingintroduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to thecomputational graphsof precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines likevLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozenMultimodal Large Language Model(MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiledcomputational graphs. It relies onbackpropagationof gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach’s effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive withLoRAacross mathematics and structured-tool-use benchmarks.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.11854
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.11854 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.11854 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.11854 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
ART: Attention Run-time Termination for Efficient Large Language Model Decoding
This paper proposes ART, a lightweight run-time mechanism that tracks accumulated attention outputs during LLM decoding and terminates unnecessary KV block accesses, achieving 20% higher generation throughput with comparable accuracy.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.
Reinforcing Multimodal Reasoning Against Visual Degradation
This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.
CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs
CRMA introduces a spectrally-bounded residual adapter that enables continual fine-tuning of LLMs without catastrophic forgetting by enforcing a doubly-stochastic mixing matrix via Sinkhorn normalization. Experimental results on Mistral-7B and Gemma-2-9B show improved backward transfer and reduced forgetting compared to frozen-substrate baselines.
Goal-Conditioned Supervised Learning for LLM Fine-Tuning
This paper proposes goal-conditioned supervised learning (GCSL) as an offline fine-tuning framework for LLMs, which treats feedback as an explicit goal and trains models via supervised learning with a novel goal formulation and natural-language goal representations. Evaluated on non-toxic generation, code generation, and recommendation, it outperforms standard offline baselines.