Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Hugging Face Daily Papers 06/10/26, 09:30 AM Papers

fine-tuning multimodal-llm parameter-efficient gradient-backpropagation visual-input-optimization qwen art

Summary

ART (Art-based Reinforcement Training) enables parameter-efficient fine-tuning of frozen multimodal LLMs by optimizing raw visual input via gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs for high-throughput engines like vLLM.

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

Original Article

View Cached Full Text

Cached at: 06/11/26, 01:37 PM

Paper page - Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Source: https://huggingface.co/papers/2606.11854

Abstract

ART enables parameter-efficient fine-tuning of frozen multimodal language models by optimizing raw visual input through gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs.

There are two mainParameter-Efficient Fine-Tuning(PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers,Soft Promptingintroduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to thecomputational graphsof precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines likevLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozenMultimodal Large Language Model(MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiledcomputational graphs. It relies onbackpropagationof gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach’s effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive withLoRAacross mathematics and structured-tool-use benchmarks.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.11854

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.11854 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.11854 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.11854 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Paper page - Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

@akshay_pachaar: LLM fine-tuning techniques I'd learn if I were to customize them: Bookmark this. 1. LoRA 2. QLoRA 3. Prefix Tuning 4. A…

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

Reinforcing Multimodal Reasoning Against Visual Degradation

Submit Feedback

Similar Articles

@akshay_pachaar: LLM fine-tuning techniques I'd learn if I were to customize them: Bookmark this. 1. LoRA 2. QLoRA 3. Prefix Tuning 4. A…

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

Reinforcing Multimodal Reasoning Against Visual Degradation