Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Hugging Face Daily Papers Papers

Summary

ART (Art-based Reinforcement Training) enables parameter-efficient fine-tuning of frozen multimodal LLMs by optimizing raw visual input via gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs for high-throughput engines like vLLM.

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:37 PM

Paper page - Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Source: https://huggingface.co/papers/2606.11854

Abstract

ART enables parameter-efficient fine-tuning of frozen multimodal language models by optimizing raw visual input through gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs.

There are two mainParameter-Efficient Fine-Tuning(PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers,Soft Promptingintroduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to thecomputational graphsof precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines likevLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozenMultimodal Large Language Model(MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiledcomputational graphs. It relies onbackpropagationof gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach’s effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive withLoRAacross mathematics and structured-tool-use benchmarks.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.11854

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.11854 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.11854 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.11854 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

arXiv cs.CL

This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.

Reinforcing Multimodal Reasoning Against Visual Degradation

Hugging Face Daily Papers

This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.

CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

arXiv cs.LG

CRMA introduces a spectrally-bounded residual adapter that enables continual fine-tuning of LLMs without catastrophic forgetting by enforcing a doubly-stochastic mixing matrix via Sinkhorn normalization. Experimental results on Mistral-7B and Gemma-2-9B show improved backward transfer and reduced forgetting compared to frozen-substrate baselines.

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

arXiv cs.LG

This paper proposes goal-conditioned supervised learning (GCSL) as an offline fine-tuning framework for LLMs, which treats feedback as an explicit goal and trains models via supervised learning with a novel goal formulation and natural-language goal representations. Evaluated on non-toxic generation, code generation, and recommendation, it outperforms standard offline baselines.