LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models
Summary
LayerRoute is a lightweight adapter that selectively skips transformer blocks during inference based on input type, achieving compute savings while maintaining or improving model quality through gated routing and LoRA adaptation. It achieves a 12.91% skip differential on agentic language models.
View Cached Full Text
Cached at: 06/08/26, 03:16 PM
Paper page - LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models
Source: https://huggingface.co/papers/2606.01838
Abstract
LayerRoute is a lightweight adapter that selectively skips transformer blocks during inference based on input type, achieving compute savings while maintaining or improving model quality through gated routing and LoRA adaptation.
Agentic language model systems alternate between two structurally distinct step types:structured tool calls(short, deterministic, lowperplexity) andopen-ended planning/reasoning steps (long, complex, highperplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduceLayerRoute, a lightweight adapter that learns to selectively skiptransformer blockson a per-input basis.LayerRouteaugments each of the 24transformer blocksin Qwen2.5-0.5B-Instruct with: (1) a per-layerrouter(~897 parameters, Linear(896,1)) that outputs a hard binary gate via thestraight-through estimator, and (2)LoRA adapters(rank 8, ~1.08M parameters) on the Q/K/V/Oattention projections. Thebackbone weightsremain frozen. A singleend-to-end trainingpass on agentic data (Hermes, Glaive, GSM8K, Turing) with agate regularisationterm forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB),LayerRouteachieves a 12.91% skip differential: tool calls skip 15.25% ofFLOPswhile planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, withperplexitydelta of -1.29 on tool calls and -1.30 on planning.
View arXiv pageView PDFProject pageGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.01838
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.01838 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.01838 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.01838 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.
Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
Aletheia introduces a gradient-guided layer selection method for efficient LoRA fine-tuning that identifies task-relevant transformer layers via lightweight gradient probes and applies adapters selectively, achieving 15-28% training speedup across 14 models while maintaining downstream performance on MMLU, GSM8K, and HumanEval benchmarks.
Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
Hybrid-LoRA proposes a framework that selectively applies full fine-tuning to a small subset of modules while using LoRA for the rest, achieving performance near full fine-tuning with significantly lower computational cost. Experiments show improvements of up to 5.65% over existing parameter-efficient baselines.
Parameter-Efficient Fine-Tuning with Learnable Rank
Researchers from Adelaide University introduce LR-LoRA (Learnable Rank LoRA), a parameter-efficient fine-tuning method that dynamically learns the adapter rank for each transformer layer during training rather than using a fixed global rank. LR-LoRA achieves state-of-the-art performance on language understanding and commonsense reasoning benchmarks, outperforming fixed-rank LoRA baselines.
Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection
The article introduces Echo-LoRA, a new parameter-efficient fine-tuning method that injects cross-layer representations from deeper source layers into shallow LoRA modules to improve performance without adding inference-time overhead.