LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

arXiv cs.AI 05/08/26, 04:00 AM Papers
Summary
This paper introduces LANTERN, a framework for multi-source neurosymbolic transfer in reinforcement learning that uses LLMs to generate task automata and adaptive gating to improve sample efficiency.
arXiv:2605.05478v1 Announce Type: new Abstract: Transfer learning in reinforcement learning (RL) seeks to accelerate learning in new tasks by leveraging knowledge from related sources. Existing neurosymbolic transfer methods, however, typically rely on manually specified task automata, assume a single source task, and use fixed knowledge-integration mechanisms that cannot adapt to varying source relevance. We propose LANTERN, a unified framework for multi-source neurosymbolic transfer that addresses these limitations through three components: (i) deterministic finite automata generated from natural language task descriptions using large language models, (ii) semantic embedding-based aggregation of multiple source policies weighted by cross-task similarity, and (iii) adaptive teacher-student gating based on temporal-difference error and semantic uncertainty. Across domains spanning resource management, navigation, and control, LANTERN achieves 40-60% improvements in sample efficiency over existing baselines while remaining robust to poorly aligned sources. These results demonstrate that multi-source, adaptively weighted neurosymbolic transfer can improve scalability and robustness in symbolic RL settings.
Original Article
View Cached Full Text
Cached at: 05/08/26, 08:18 AM
# LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks
Source: [https://arxiv.org/html/2605.05478](https://arxiv.org/html/2605.05478)
\\NameMahyar Alinejad1\\Emailmahyar\.alinejad@ucf\.edu \\NameYue Wang1,2\\Emailyue\.wang@ucf\.edu \\NameAmrit Singh Bedi2\\Emailamritbedi@ucf\.edu \\NameGeorge Atia1,2\\Emailgeorge\.atia@ucf\.edu \\addr1Department of Electrical and Computer EngineeringOrlandoFloridaUSA \\addr2Department of Computer ScienceUniversity of Central FloridaOrlandoFloridaUSA

###### Abstract

Transfer learning in reinforcement learning \(RL\) seeks to accelerate learning in new tasks by leveraging knowledge from related sources\. Existing neurosymbolic transfer methods, however, typically rely on manually specified task automata, assume a single source task, and use fixed knowledge\-integration mechanisms that cannot adapt to varying source relevance\. We propose LANTERN, a unified framework for multi\-source neurosymbolic transfer that addresses these limitations through three components: \(i\) deterministic finite automata generated from natural language task descriptions using large language models, \(ii\) semantic embedding\-based aggregation of multiple source policies weighted by cross\-task similarity, and \(iii\) adaptive teacher\-student gating based on temporal\-difference error and semantic uncertainty\. Across domains spanning resource management, navigation, and control, LANTERN achieves 40–60% improvements in sample efficiency over existing baselines while remaining robust to poorly aligned sources\. These results demonstrate that multi\-source, adaptively weighted neurosymbolic transfer can improve scalability and robustness in symbolic RL settings\.

###### keywords:

Reinforcement Learning, Transfer Learning, Neurosymbolic AI, Large Language Models, Automata Learning

## 1Introduction

Reinforcement learning \(RL\) has achieved strong empirical performance in game playing\(Mnih2015HumanLevel;Silver2016Go\), robotics\(Kober2013Reinforcement;levine2016end\), and autonomous systems\(Kiran2021Deep\)\. However, effective policy learning often requires extensive interaction, limiting applicability when data collection is costly or unsafe\(dulac2019challenges\)\. Transfer learning mitigates this by leveraging knowledge from related tasks\(Taylor2009Transfer;zhu2023transfer\), yet existing approaches remain challenged by structured, long\-horizon objectives that are naturally non\-Markovian\.

Neurosymbolic RL integrates symbolic task representations \(such as deterministic finite automata \(DFAs\) or reward machines\(ToroIcarte2018UsingRL;Icarte2022Reward\)\) into learning\. By encoding temporal structure through product MDP constructions\(Bacchus1997NMRDP\), these methods improve sample efficiency for complex tasks\. Despite these advances, several limitations remain\.1\) Manual specification:Most methods assume expert\-provided DFAs or temporal logic formulas\(Littman2017Environment;Camacho2019LTL;Hahn2019Omega\)\. Grammatical inference approaches can recover automata from demonstrations\(Angluin1987Learning;Oncina1992;Alinejad2024Hybrid;Alinejad2026Dynamic\), but they require structured trajectory data and are difficult to apply in sparse or exploratory RL settings\.2\) Single\-source transfer:Automaton distillation transfers symbolic guidance via DFA transitions\(Singireddy2023AutomatonDistillation;Alinejad2025NEUS\), while policy distillation provides action\-level knowledge\(Rusu2015PolicyDistillation\)\. CADENT combines both using experience\-based gating\(Alinejad2026Hybrid\)\. However, these approaches rely on a single source task, which can limit effectiveness when source\-target alignment varies\.3\) Fixed integration mechanisms:Existing methods typically employ predetermined weighting schemes \(e\.g\., exponential decay\(Singireddy2023AutomatonDistillation\)\) or static hyperparameters, limiting adaptability when source relevance changes across states or over time\.

Key insight and technical novelty\.We consider neurosymbolic transfer in a setting where multiple source tasks may have partially related but distinct goals from the target\. In this regime, transfer cannot rely on direct reuse of a single source policy or automaton; instead, it requires semantic alignment and aggregation of structured knowledge across heterogeneous tasks\.

To address this setting, we introduceLANTERN\(LLM\-AugmentedNeurosymbolicTransfer withExperience\-gatedReasoningNetworks\)\. LANTERN integrates three components\. First, DFAs are generated from natural language task descriptions using large language models \(LLMs\), eliminating manual specification\. Second, we construct a shared embedding space over automaton state descriptions, enabling aggregation of partial knowledge from multiple source tasks with heterogeneous goals\. Third, we introduce a dual\-volatility gating mechanism that combines semantic alignment \(measured via embedding similarity\) with experience\-based reliability \(measured via TD error\), allowing adaptive weighting of teacher influence during learning\.

Contributions\.Our contributions are threefold: 1\) We formulate multi\-source neurosymbolic transfer in a setting where source tasks may have heterogeneous goals, requiring semantic aggregation rather than direct reuse of a single source policy or automaton\.

2\) We develop LANTERN, which integrates LLM\-based automaton generation, semantic multi\-source aggregation, and adaptive trust gating within a single neurosymbolic transfer architecture\.

3\) Across diverse domains, we demonstrate 40–60% improvements in sample efficiency over single\-source and static\-integration baselines, while maintaining robustness to poorly aligned sources\.

### 1\.1Related Work

Transfer learning in RL\.Classical transfer methods include value function reuse\(Taylor2007Cross\), policy distillation\(Rusu2015PolicyDistillation;Czarnecki2019Distilling\), and successor features\(Barreto2017Successo;Barreto2020Fast\)\. Meta\-learning\(Finn2017Model;rakelly2019efficient\)and multi\-task learning\(Parisotto2015ActorMimic;teh2017distral\)share representations across related tasks\. These approaches typically assume Markovian reward structures and do not explicitly model temporal logic or automaton\-based task decomposition\.

Neurosymbolic RL\.To address non\-Markovian objectives, reward machines\(ToroIcarte2018UsingRL;Icarte2022Reward\)and temporal logic specifications\(Littman2017Environment;Camacho2019LTL;Hahn2019Omega\)encode structured task progression via product MDP constructions\. Extensions integrate automata with deep RL\(Hasanbeig2020Deep;DeGiacomo2019Shielding\)or infer specifications from demonstrations\(VazquezChanlatte2018LearningSpecs\)\. However, these works focus primarily on single\-task learning rather than transfer across heterogeneous tasks\.

Automaton\-based transfer\.Recent work leverages automaton structure for transfer\. Automaton distillation\(Singireddy2023AutomatonDistillation\)transfers high\-level task decomposition through DFA\-guided Q\-value aggregation\. Bidirectional transfer frameworks\(Alinejad2025NEUS\)enable mutual knowledge exchange, while CADENT\(Alinejad2026Hybrid\)combines strategic automaton guidance with tactical policy distillation using experience\-based gating\. ARM\-FM\(Creus2024ARMFM\)employs LLM\-generated reward machines for transfer\. These approaches, however, rely on single\-source settings and fixed or experience\-only integration mechanisms\.

LLMs in RL\.LLMs have been used to provide planning guidance\(Jiang2019Language\), programmatic policy representations\(Verma2018Programmatically;andreasmodular\), and zero\-shot generalization signals\(Oh2017Zero\)\. In contrast to approaches that use language primarily for prompting or reward shaping, LANTERN generates formal DFAs compatible with product MDP constructions and integrates them into a multi\-source neurosymbolic transfer framework\.

The remainder of this paper is structured as follows: Section[2](https://arxiv.org/html/2605.05478#S2)provides necessary background on product MDPs and transfer learning\. Section[3](https://arxiv.org/html/2605.05478#S3)details the LANTERN framework\. Section[4](https://arxiv.org/html/2605.05478#S4)reports experimental results across four domains\. Section[5](https://arxiv.org/html/2605.05478#S5)concludes with future directions\.

## 2Background

### 2\.1Markov Decision Processes and Q\-Learning

A Markov Decision Process \(MDP\)\(SuttonBarto\)is a tupleℳ=⟨𝒮,𝒜,𝒯,ℛ,γ⟩\\mathcal\{M\}=\\langle\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},\\mathcal\{R\},\\gamma\\rangle, where𝒮\\mathcal\{S\}is the state space,𝒜\\mathcal\{A\}is the action space,𝒯:𝒮×𝒜×𝒮→\[0,1\]\\mathcal\{T\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\times\\mathcal\{S\}\\to\[0,1\]is the transition function,ℛ:𝒮×𝒜→ℝ\\mathcal\{R\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathbb\{R\}is the reward function, andγ∈\[0,1\)\\gamma\\in\[0,1\)is the discount factor\. A policyπ:𝒮→Δ\(𝒜\)\\pi:\\mathcal\{S\}\\to\\Delta\(\\mathcal\{A\}\)maps states to action distributions, whereΔ\(𝒜\)\\Delta\(\\mathcal\{A\}\)is the probability simplex on𝒜\\mathcal\{A\}\.

The goal is to findπ∗=arg⁡maxπ⁡𝔼π\[∑t=0∞γtℛ\(st,at\)∣s0\]\\pi^\{\*\}=\\arg\\max\_\{\\pi\}\\mathbb\{E\}\_\{\\pi\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\mathcal\{R\}\(s\_\{t\},a\_\{t\}\)\\mid s\_\{0\}\]\. The optimal action\-value functionQ∗\(s,a\)=maxπ⁡Qπ\(s,a\)Q^\{\*\}\(s,a\)=\\max\_\{\\pi\}Q^\{\\pi\}\(s,a\)satisfies:

Q∗\(s,a\)=𝔼s′\[ℛ\(s,a\)\+γmaxa′⁡Q∗\(s′,a′\)\]\.Q^\{\*\}\(s,a\)=\\mathbb\{E\}\_\{s^\{\\prime\}\}\\left\[\\mathcal\{R\}\(s,a\)\+\\gamma\\max\_\{a^\{\\prime\}\}Q^\{\*\}\(s^\{\\prime\},a^\{\\prime\}\)\\right\]\.\(1\)
Q\-learning\(Watkins1992QLearning\)iteratively estimatesQ∗Q^\{\*\}via:

Q\(st,at\)←Q\(st,at\)\+α\[ℛ\(st,at\)\+γmaxa′⁡Q\(st\+1,a′\)−Q\(st,at\)\],Q\(s\_\{t\},a\_\{t\}\)\\leftarrow Q\(s\_\{t\},a\_\{t\}\)\+\\alpha\\left\[\\mathcal\{R\}\(s\_\{t\},a\_\{t\}\)\+\\gamma\\max\_\{a^\{\\prime\}\}Q\(s\_\{t\+1\},a^\{\\prime\}\)\-Q\(s\_\{t\},a\_\{t\}\)\\right\],\(2\)whereα∈\(0,1\]\\alpha\\in\(0,1\]is the learning rate\.

### 2\.2Product MDPs for Non\-Markovian Objectives

Many tasks involve non\-Markovian objectives depending on state history\(Bacchus1997NMRDP\)\. A DFA𝒟=⟨Ω,Σ,δ,ω0,F⟩\\mathcal\{D\}=\\langle\\Omega,\\Sigma,\\delta,\\omega\_\{0\},F\\ranglespecifies such tasks, whereΩ\\Omegais the set of automaton states,Σ\\Sigmais the set of labels,δ:Ω×Σ→Ω\\delta:\\Omega\\times\\Sigma\\to\\Omegais the transition function,ω0\\omega\_\{0\}is the initial state, andF⊆ΩF\\subseteq\\Omegaare accepting states\. A labeling functionL:𝒮→ΣL:\\mathcal\{S\}\\to\\Sigmamaps MDP states to labels\.

The product MDPℳ×𝒟=⟨𝒮×Ω,𝒜,𝒯′,ℛ′,γ⟩\\mathcal\{M\}\\times\\mathcal\{D\}=\\langle\\mathcal\{S\}\\times\\Omega,\\mathcal\{A\},\\mathcal\{T\}^\{\\prime\},\\mathcal\{R\}^\{\\prime\},\\gamma\\ranglehas state space𝒮×Ω\\mathcal\{S\}\\times\\Omegawith\(s,ω\)\(s,\\omega\)representing the agent at MDP statesswith the automaton in stateω\\omega\. The transition function is𝒯′\(\(s,ω\),a,\(s′,ω′\)\)=𝒯\(s,a,s′\)\\mathcal\{T\}^\{\\prime\}\(\(s,\\omega\),a,\(s^\{\\prime\},\\omega^\{\\prime\}\)\)=\\mathcal\{T\}\(s,a,s^\{\\prime\}\)ifω′=δ\(ω,L\(s′\)\)\\omega^\{\\prime\}=\\delta\(\\omega,L\(s^\{\\prime\}\)\), and0otherwise\. The reward functionℛ′\(\(s,ω\),a\)\\mathcal\{R\}^\{\\prime\}\(\(s,\\omega\),a\)is designed based on automaton progress, typically providing sparse rewards when reaching accepting states \(ω′∈F\\omega^\{\\prime\}\\in F\) or incremental rewards when making automaton transitions \(ω≠ω′\\omega\\neq\\omega^\{\\prime\}\)\. This transforms non\-Markovian objectives into standard MDP learning by tracking task progress throughω\\omega\(ToroIcarte2018UsingRL;Icarte2022Reward\)\.

### 2\.3Transfer Learning and Neurosymbolic Methods

Transfer learning accelerates target task learning by leveraging source knowledge\(Taylor2009Transfer\)\.Policy distillation\(Rusu2015PolicyDistillation\)trains studentπstudent\\pi^\{\\text\{student\}\}to mimic teacherπteacher\\pi^\{\\text\{teacher\}\}by minimizingDKL\(πteacher\(⋅\|s\)∥πstudent\(⋅\|s\)\)D\_\{\\text\{KL\}\}\(\\pi^\{\\text\{teacher\}\}\(\\cdot\|s\)\\\|\\pi^\{\\text\{student\}\}\(\\cdot\|s\)\)\.

Automaton distillation\(Singireddy2023AutomatonDistillation;Alinejad2025NEUS\)transfers strategic knowledge via DFA transitions\. Given a teacher Q\-function,QteacherQ^\{\\text\{teacher\}\}, learned on a source product MDP, the method computes aggregated Q\-values for each automaton transition\(ω,ω′\)\(\\omega,\\omega^\{\\prime\}\):

QAD\(ω,ω′\)=1\|𝒮ω→ω′\|∑\(s,a\)∈𝒮ω→ω′Qteacher\(\(s,ω\),a\),Q\_\{\\text\{AD\}\}\(\\omega,\\omega^\{\\prime\}\)=\\frac\{1\}\{\|\\mathcal\{S\}\_\{\\omega\\to\\omega^\{\\prime\}\}\|\}\\sum\_\{\(s,a\)\\in\\mathcal\{S\}\_\{\\omega\\to\\omega^\{\\prime\}\}\}Q^\{\\text\{teacher\}\}\(\(s,\\omega\),a\),\(3\)where𝒮ω→ω′=\{\(s,a\):δ\(ω,L\(s′\)\)=ω′,s′∼𝒯\(s,a,⋅\)\}\\mathcal\{S\}\_\{\\omega\\to\\omega^\{\\prime\}\}=\\\{\(s,a\):\\delta\(\\omega,L\(s^\{\\prime\}\)\)=\\omega^\{\\prime\},s^\{\\prime\}\\sim\\mathcal\{T\}\(s,a,\\cdot\)\\\}is the set of state\-action pairs that trigger the automaton transition fromω\\omegatoω′\\omega^\{\\prime\}, andQteacher\(\(s,ω\),a\)Q^\{\\text\{teacher\}\}\(\(s,\\omega\),a\)is the teacher’s learned action\-value function on the product MDP\. During target task learning, the student receives additional rewardλAD⋅QAD\(ω,ω′\)\\lambda\_\{\\text\{AD\}\}\\cdot Q\_\{\\text\{AD\}\}\(\\omega,\\omega^\{\\prime\}\)when making automaton transitions\.

CADENTcombines strategic and tactical guidance with experience\-based gating\(Alinejad2026Hybrid\)\. It tracks temporal\-difference \(TD\) error volatility \(a measure of learning instability\) for each state\-action pair:Vt\(s,a\)←\(1−η\)Vt−1\(s,a\)\+η\|δt\(s,a\)\|,V\_\{t\}\(s,a\)\\leftarrow\(1\-\\eta\)V\_\{t\-1\}\(s,a\)\+\\eta\|\\delta\_\{t\}\(s,a\)\|,whereδt\(s,a\)=rt\+γmaxa′⁡Qt\(s′,a′\)−Qt\(s,a\)\\delta\_\{t\}\(s,a\)=r\_\{t\}\+\\gamma\\max\_\{a^\{\\prime\}\}Q\_\{t\}\(s^\{\\prime\},a^\{\\prime\}\)\-Q\_\{t\}\(s,a\)is the TD error at timett, andη∈\(0,1\)\\eta\\in\(0,1\)is a smoothing parameter\. A trust gate measuring confidence in the student’s own estimate is computed asτ\(s,a\)=σ\(−k\(V\(s,a\)−θ\)\),\\tau\(s,a\)=\\sigma\\\!\\left\(\-k\\big\(V\(s,a\)\-\\theta\\big\)\\right\),whereσ\(x\)=1/\(1\+e−x\)\\sigma\(x\)=1/\(1\+e^\{\-x\}\)is the sigmoid function,k\>0k\>0controls gate sharpness, andθ∈\(0,1\)\\theta\\in\(0,1\)is a threshold\. The Q\-update balances student learning and teacher guidance:

ΔQ\(s,a\)=α\[τ\(s,a\)δt\(s,a\)\+\(1−τ\(s,a\)\)Gteacher\(s,a\)\],\\Delta Q\(s,a\)=\\alpha\\left\[\\tau\(s,a\)\\,\\delta\_\{t\}\(s,a\)\+\(1\-\\tau\(s,a\)\)\\,G\_\{\\text\{teacher\}\}\(s,a\)\\right\],\(4\)where

Gteacher\(s,a\)=λADrAD\(ω,ω′\)\+λPD\(πteacher\(a\|s\)−πstudent\(a\|s\)\),G\_\{\\text\{teacher\}\}\(s,a\)=\\lambda\_\{\\text\{AD\}\}r\_\{\\text\{AD\}\}\(\\omega,\\omega^\{\\prime\}\)\+\\lambda\_\{\\text\{PD\}\}\\big\(\\pi^\{\\text\{teacher\}\}\(a\|s\)\-\\pi^\{\\text\{student\}\}\(a\|s\)\\big\),combines strategic guidance \(intrinsic rewardrADr\_\{\\text\{AD\}\}for automaton transitionsω→ω′\\omega\\to\\omega^\{\\prime\}\) and tactical guidance \(policy discrepancy\), weighted byλAD,λPD≥0\\lambda\_\{\\text\{AD\}\},\\lambda\_\{\\text\{PD\}\}\\geq 0\.

ARM\-FM\(Creus2024ARMFM\)generates reward machines via LLMs with single\-source embedding transfer\. However, all existing methods use single sources and lack graceful degradation under misalignment\.

## 3Formulation and Proposed LANTERN Framework

### 3\.1Problem Formulation

ConsiderKKsource tasks, each modeled as a product MDPℳksrc×𝒟ksrc\\mathcal\{M\}\_\{k\}^\{\\text\{src\}\}\\times\\mathcal\{D\}\_\{k\}^\{\\text\{src\}\}, where the automaton𝒟ksrc=⟨Ωksrc,Σk,δksrc,ω0,ksrc,Fksrc⟩\\mathcal\{D\}\_\{k\}^\{\\text\{src\}\}=\\langle\\Omega\_\{k\}^\{\\text\{src\}\},\\Sigma\_\{k\},\\delta\_\{k\}^\{\\text\{src\}\},\\omega\_\{0,k\}^\{\\text\{src\}\},F\_\{k\}^\{\\text\{src\}\}\\rangleencodes the task structure\. For each source taskkk, we assume access to: \(i\) a learned Q\-functionQkteacherQ\_\{k\}^\{\\text\{teacher\}\}defined on the product MDP, \(ii\) distilled strategic knowledgeQk,ADQ\_\{k,\\text\{AD\}\}\(Eq\.[3](https://arxiv.org/html/2605.05478#S2.E3)\), \(iii\) a distilled tactical policyπkteacher\\pi\_\{k\}^\{\\text\{teacher\}\}, and \(iv\) semantic descriptionsdesck:Ωksrc→𝒱\\mathrm\{desc\}\_\{k\}:\\Omega\_\{k\}^\{\\text\{src\}\}\\to\\mathcal\{V\}mapping each automaton state to a natural language description in vocabulary space𝒱\\mathcal\{V\}\.

The target task is specified only by a natural language description𝒯desc∈𝒱\\mathcal\{T\}\_\{\\text\{desc\}\}\\in\\mathcal\{V\}and a base MDPℳtgt\\mathcal\{M\}^\{\\text\{tgt\}\}, without a manually provided automaton\. Our objective is to learn a target policyπtgt\\pi^\{\\text\{tgt\}\}that maximizes expected return on the induced product MDP while leveraging the multi\-source knowledge set𝒦=\{\(Qkteacher,Qk,AD,πkteacher,desck\)\}k=1K,\\mathcal\{K\}=\\\{\(Q\_\{k\}^\{\\text\{teacher\}\},Q\_\{k,\\text\{AD\}\},\\pi\_\{k\}^\{\\text\{teacher\}\},\\mathrm\{desc\}\_\{k\}\)\\\}\_\{k=1\}^\{K\},under the setting where source tasks may have heterogeneous goals and distinct automaton structures\. The automaton state spacesΩksrc\\Omega\_\{k\}^\{\\text\{src\}\}are not aligned across tasks, and source goals may differ from the target goal, precluding direct reuse of symbolic states or value functions\.

### 3\.2Phase 1: LLM\-Enhanced Automaton Generation

Given a natural language task description𝒯desc∈𝒱\\mathcal\{T\}\_\{\\text\{desc\}\}\\in\\mathcal\{V\}, we use a LLMℒ\\mathcal\{L\}to generate a target DFA𝒟tgt=⟨Ωtgt,Σtgt,δtgt,ω0tgt,Ftgt⟩,\\mathcal\{D\}^\{\\text\{tgt\}\}=\\langle\\Omega^\{\\text\{tgt\}\},\\Sigma^\{\\text\{tgt\}\},\\delta^\{\\text\{tgt\}\},\\omega\_\{0\}^\{\\text\{tgt\}\},F^\{\\text\{tgt\}\}\\rangle,together with semantic state descriptionsdesctgt:Ωtgt→𝒱\\mathrm\{desc\}^\{\\text\{tgt\}\}:\\Omega^\{\\text\{tgt\}\}\\to\\mathcal\{V\}that assign each automaton state a natural language description\.

#### Prompt construction\.

We design a structured prompt𝒫\(𝒯desc\)\\mathcal\{P\}\(\\mathcal\{T\}\_\{\\text\{desc\}\}\)that instructsℒ\\mathcal\{L\}to: \(i\) extract key subgoals and temporal dependencies from𝒯desc\\mathcal\{T\}\_\{\\text\{desc\}\}, \(ii\) define automaton statesΩtgt\\Omega^\{\\text\{tgt\}\}representing task\-progress milestones, \(iii\) specify a deterministic transition functionδtgt:Ωtgt×Σtgt→Ωtgt\\delta^\{\\text\{tgt\}\}:\\Omega^\{\\text\{tgt\}\}\\times\\Sigma^\{\\text\{tgt\}\}\\to\\Omega^\{\\text\{tgt\}\}, \(iv\) designate the initial stateω0tgt\\omega\_\{0\}^\{\\text\{tgt\}\}and accepting statesFtgtF^\{\\text\{tgt\}\}, and \(v\) provide a semantic descriptiondesctgt\(ω\)\\mathrm\{desc\}^\{\\text\{tgt\}\}\(\\omega\)for each stateω∈Ωtgt\\omega\\in\\Omega^\{\\text\{tgt\}\}\.

#### Example\.

Given𝒯desc=\\mathcal\{T\}\_\{\\text\{desc\}\}=“Navigate dungeon to collect key and shield, then open chest for sword, finally defeat dragon,” the LLM generates a DFA with:

- •States:Ωtgt=\{ω0,ω1,ω2,ω3,ω4\}\\Omega^\{\\text\{tgt\}\}=\\\{\\omega\_\{0\},\\omega\_\{1\},\\omega\_\{2\},\\omega\_\{3\},\\omega\_\{4\}\\\}
- •Descriptions:desctgt\(ω0\)=\\mathrm\{desc\}^\{\\text\{tgt\}\}\(\\omega\_\{0\}\)=“start mission”,desctgt\(ω1\)=\\mathrm\{desc\}^\{\\text\{tgt\}\}\(\\omega\_\{1\}\)=“collect key”,desctgt\(ω2\)=\\mathrm\{desc\}^\{\\text\{tgt\}\}\(\\omega\_\{2\}\)=“collect shield”,desctgt\(ω3\)=\\mathrm\{desc\}^\{\\text\{tgt\}\}\(\\omega\_\{3\}\)=“obtain sword from chest”,desctgt\(ω4\)=\\mathrm\{desc\}^\{\\text\{tgt\}\}\(\\omega\_\{4\}\)=“defeat dragon \(goal\)”
- •Transitions:ω0→keyω1→shieldω2→swordω3→dragonω4\\omega\_\{0\}\\xrightarrow\{\\text\{key\}\}\\omega\_\{1\}\\xrightarrow\{\\text\{shield\}\}\\omega\_\{2\}\\xrightarrow\{\\text\{sword\}\}\\omega\_\{3\}\\xrightarrow\{\\text\{dragon\}\}\\omega\_\{4\}\.

#### Product MDP construction\.

A labeling functionLtgt:𝒮tgt→ΣtgtL^\{\\text\{tgt\}\}:\\mathcal\{S\}^\{\\text\{tgt\}\}\\to\\Sigma^\{\\text\{tgt\}\}maps environment states to automaton symbols based on observable conditions \(e\.g\., item collection or goal completion events\)\. Given the base MDPℳtgt\\mathcal\{M\}^\{\\text\{tgt\}\}and the LLM\-generated DFA𝒟tgt\\mathcal\{D\}^\{\\text\{tgt\}\}, we construct the product MDPℳtgt×𝒟tgt\\mathcal\{M\}^\{\\text\{tgt\}\}\\times\\mathcal\{D\}^\{\\text\{tgt\}\}with augmented state space𝒮tgt×Ωtgt\\mathcal\{S\}^\{\\text\{tgt\}\}\\times\\Omega^\{\\text\{tgt\}\}, following the standard construction described in Section[2](https://arxiv.org/html/2605.05478#S2)\. While prior work has used LLMs to generate reward machines or automata\(Creus2024ARMFM\), our use of semantic descriptions extends beyond specification generation\. The descriptionsdesctgt\\mathrm\{desc\}^\{\\text\{tgt\}\}define a semantic representation of automaton states that will later support multi\-source knowledge aggregation and adaptive teacher\-student gating within the LANTERN framework\.

### 3\.3Phase 2: Semantic Embedding and Neighborhood Construction

To enable transfer across heterogeneous source tasks, we construct a shared semantic embedding space over automaton state descriptions\. Unlike single\-source approaches\(Creus2024ARMFM;Alinejad2026Hybrid\), this embedding allows alignment and aggregation of symbolic states originating from different task goals\.

For each automaton stateω\\omegawith descriptiondesc\(ω\)\\mathrm\{desc\}\(\\omega\), we compute an embedding

ϕ\(ω\)=ℰ\(desc\(ω\)\)∈ℝd,\\phi\(\\omega\)=\\mathcal\{E\}\(\\mathrm\{desc\}\(\\omega\)\)\\in\\mathbb\{R\}^\{d\},\(5\)whereℰ:𝒱→ℝd\\mathcal\{E\}:\\mathcal\{V\}\\to\\mathbb\{R\}^\{d\}is a fixed text\-embedding model \(e\.g\., sentence\-BERT\), andddis the embedding dimension\. The same embedding function is used for both source and target automata, yielding a shared semantic space\.

#### Cross\-task similarity\.

Given a target stateωtgt∈Ωtgt\\omega^\{\\text\{tgt\}\}\\in\\Omega^\{\\text\{tgt\}\}and a source stateωksrc∈Ωksrc\\omega\_\{k\}^\{\\text\{src\}\}\\in\\Omega\_\{k\}^\{\\text\{src\}\}, we define semantic similarity via cosine similarity

sim\(ωtgt,ωksrc\)=ϕ\(ωtgt\)⊤ϕ\(ωksrc\)‖ϕ\(ωtgt\)‖‖ϕ\(ωksrc\)‖\.\\mathrm\{sim\}\(\\omega^\{\\text\{tgt\}\},\\omega\_\{k\}^\{\\text\{src\}\}\)=\\frac\{\\phi\(\\omega^\{\\text\{tgt\}\}\)^\{\\top\}\\phi\(\\omega\_\{k\}^\{\\text\{src\}\}\)\}\{\\\|\\phi\(\\omega^\{\\text\{tgt\}\}\)\\\|\\,\\\|\\phi\(\\omega\_\{k\}^\{\\text\{src\}\}\)\\\|\}\.\(6\)High similarity indicates alignment in task\-progress semantics, even when tasks differ in domain\.

#### Semantic neighborhoods\.

For each target stateωtgt∈Ωtgt\\omega^\{\\text\{tgt\}\}\\in\\Omega^\{\\text\{tgt\}\}, we consider all source automaton states across theKKtasks and compute their semantic similarity toωtgt\\omega^\{\\text\{tgt\}\}\. The semantic neighborhood𝒩M\(ωtgt\)\\mathcal\{N\}\_\{M\}\(\\omega^\{\\text\{tgt\}\}\)is defined as the set of theMMsource states with the largest similarity values:

𝒩M\(ωtgt\)=\{\(ωksrc,k\):ωksrc∈Ωksrc,ranked among the top\-Mbysim\(ωtgt,ωksrc\)\}\.\\mathcal\{N\}\_\{M\}\(\\omega^\{\\text\{tgt\}\}\)=\\left\\\{\(\\omega\_\{k\}^\{\\text\{src\}\},k\):\\omega\_\{k\}^\{\\text\{src\}\}\\in\\Omega\_\{k\}^\{\\text\{src\}\},\\ \\text\{ranked among the top\-$M$ by \}\\mathrm\{sim\}\(\\omega^\{\\text\{tgt\}\},\\omega\_\{k\}^\{\\text\{src\}\}\)\\right\\\}\.\(7\)
For each\(ωksrc,k\)∈𝒩M\(ωtgt\)\(\\omega\_\{k\}^\{\\text\{src\}\},k\)\\in\\mathcal\{N\}\_\{M\}\(\\omega^\{\\text\{tgt\}\}\), we define normalized aggregation weights

w\(ωtgt,ωksrc\)=max⁡\{sim\(ωtgt,ωksrc\),0\}∑\(ωjsrc,j\)∈𝒩M\(ωtgt\)max⁡\{sim\(ωtgt,ωjsrc\),0\},w\(\\omega^\{\\text\{tgt\}\},\\omega\_\{k\}^\{\\text\{src\}\}\)=\\frac\{\\max\\\{\\mathrm\{sim\}\(\\omega^\{\\text\{tgt\}\},\\omega\_\{k\}^\{\\text\{src\}\}\),0\\\}\}\{\\sum\_\{\(\\omega\_\{j\}^\{\\text\{src\}\},j\)\\in\\mathcal\{N\}\_\{M\}\(\\omega^\{\\text\{tgt\}\}\)\}\\max\\\{\\mathrm\{sim\}\(\\omega^\{\\text\{tgt\}\},\\omega\_\{j\}^\{\\text\{src\}\}\),0\\\}\},\(8\)so that more semantically aligned states contribute more strongly to subsequent guidance\.

### 3\.4Phase 3: Multi\-Source Knowledge Aggregation

LANTERN aggregates both strategic \(automaton\-level\) and tactical \(policy\-level\) guidance from the semantic neighborhood of each target automaton state\.

#### Strategic guidance aggregation\.

For a target automaton stateωtgt∈Ωtgt\\omega^\{\\text\{tgt\}\}\\in\\Omega^\{\\text\{tgt\}\}, we aggregate strategic guidance from semantically aligned source states\. LetQk,AD\(ωksrc,ωk′⁣src\)Q\_\{k,\\text\{AD\}\}\(\\omega\_\{k\}^\{\\text\{src\}\},\\omega\_\{k\}^\{\\prime\\text\{src\}\}\)denote the automaton\-distilled Q\-value in source taskkkfor the transitionωksrc→ωk′⁣src\\omega\_\{k\}^\{\\text\{src\}\}\\to\\omega\_\{k\}^\{\\prime\\text\{src\}\}\. We define the aggregated strategic value as

QADagg\(ωtgt\)=∑\(ωksrc,k\)∈𝒩M\(ωtgt\)w\(ωtgt,ωksrc\)Qk,AD\(ωksrc\),Q\_\{\\text\{AD\}\}^\{\\text\{agg\}\}\(\\omega^\{\\text\{tgt\}\}\)=\\sum\_\{\(\\omega\_\{k\}^\{\\text\{src\}\},k\)\\in\\mathcal\{N\}\_\{M\}\(\\omega^\{\\text\{tgt\}\}\)\}w\(\\omega^\{\\text\{tgt\}\},\\omega\_\{k\}^\{\\text\{src\}\}\)\\,Q\_\{k,\\text\{AD\}\}\(\\omega\_\{k\}^\{\\text\{src\}\}\),\(9\)whereQk,AD\(ωksrc\)Q\_\{k,\\text\{AD\}\}\(\\omega\_\{k\}^\{\\text\{src\}\}\)summarizes the strategic value of progressing fromωksrc\\omega\_\{k\}^\{\\text\{src\}\}in sourcekk\(e\.g\., expected intrinsic reward over outgoing transitions\)\. This yields a convex combination of high\-level task progression signals across sources\.

#### Tactical guidance aggregation\.

At the action level, we aggregate teacher policies from aligned source states\. Letπkteacher\(a∣sk,ωksrc\)\\pi\_\{k\}^\{\\text\{teacher\}\}\(a\\mid s\_\{k\},\\omega\_\{k\}^\{\\text\{src\}\}\)denote the teacher policy in sourcekkdefined over product states\. For a target state\(s,ωtgt\)\(s,\\omega^\{\\text\{tgt\}\}\), we define

πteacheragg\(a∣s,ωtgt\)=∑\(ωksrc,k\)∈𝒩M\(ωtgt\)w\(ωtgt,ωksrc\)πkteacher\(a∣sk,ωksrc\),\\pi\_\{\\text\{teacher\}\}^\{\\text\{agg\}\}\(a\\mid s,\\omega^\{\\text\{tgt\}\}\)=\\sum\_\{\(\\omega\_\{k\}^\{\\text\{src\}\},k\)\\in\\mathcal\{N\}\_\{M\}\(\\omega^\{\\text\{tgt\}\}\)\}w\(\\omega^\{\\text\{tgt\}\},\\omega\_\{k\}^\{\\text\{src\}\}\)\\,\\pi\_\{k\}^\{\\text\{teacher\}\}\(a\\mid s\_\{k\},\\omega\_\{k\}^\{\\text\{src\}\}\),\(10\)wheresks\_\{k\}denotes the mapped source\-state context corresponding to the target statess\.

### 3\.5Phase 4: Dual\-Volatility Experience Gating

LANTERN combines experience\-based and semantic uncertainty to adaptively balance student and teacher updates\.

#### Experience volatility\.

We track TD\-error volatility:

Vtexp\(s,a\)←\(1−η\)Vt−1exp\(s,a\)\+η\|δt\(s,a\)\|,V\_\{t\}^\{\\text\{exp\}\}\(s,a\)\\leftarrow\(1\-\\eta\)V\_\{t\-1\}^\{\\text\{exp\}\}\(s,a\)\+\\eta\|\\delta\_\{t\}\(s,a\)\|,\(11\)whereη∈\(0,1\)\\eta\\in\(0,1\)\. High volatility indicates unstable learning \(favoring teacher guidance\), while low volatility indicates convergence\.

#### Semantic volatility\.

We define

Vsem\(ωtgt\)=1−maxk,ωksrc∈Ωksrc⁡sim\(ωtgt,ωksrc\),V^\{\\text\{sem\}\}\(\\omega^\{\\text\{tgt\}\}\)=1\-\\max\_\{k,\\;\\omega\_\{k\}^\{\\text\{src\}\}\\in\\Omega\_\{k\}^\{\\text\{src\}\}\}\\mathrm\{sim\}\(\\omega^\{\\text\{tgt\}\},\\omega\_\{k\}^\{\\text\{src\}\}\),\(12\)so small values correspond to well\-aligned source states and large values to misalignment\.

#### Composite trust gate\.

We convert both volatility measures into trust coefficients as

τexp\(s,a\)\\displaystyle\\tau\_\{\\text\{exp\}\}\(s,a\)=σ\(−kexp\(Vexp\(s,a\)−θexp\)\),\\displaystyle=\\sigma\\\!\\left\(\-k\_\{\\text\{exp\}\}\(V^\{\\text\{exp\}\}\(s,a\)\-\\theta\_\{\\text\{exp\}\}\)\\right\),\(13\)τsem\(ωtgt\)\\displaystyle\\tau\_\{\\text\{sem\}\}\(\\omega^\{\\text\{tgt\}\}\)=σ\(−ksem\(Vsem\(ωtgt\)−θsem\)\),\\displaystyle=\\sigma\\\!\\left\(\-k\_\{\\text\{sem\}\}\(V^\{\\text\{sem\}\}\(\\omega^\{\\text\{tgt\}\}\)\-\\theta\_\{\\text\{sem\}\}\)\\right\),\(14\)τ\(s,ωtgt,a\)\\displaystyle\\tau\(s,\\omega^\{\\text\{tgt\}\},a\)=τexp\(s,a\)τsem\(ωtgt\),\\displaystyle=\\tau\_\{\\text\{exp\}\}\(s,a\)\\,\\tau\_\{\\text\{sem\}\}\(\\omega^\{\\text\{tgt\}\}\),\(15\)whereσ\(x\)=1/\(1\+e−x\)\\sigma\(x\)=1/\(1\+e^\{\-x\}\),kexp,ksem\>0k\_\{\\text\{exp\}\},k\_\{\\text\{sem\}\}\>0control sharpness, andθexp,θsem∈\(0,1\)\\theta\_\{\\text\{exp\}\},\\theta\_\{\\text\{sem\}\}\\in\(0,1\)are volatility thresholds\. The multiplicative form ensures teacher influence is strong only when learning is unstable and semantically aligned\.

### 3\.6Phase 5: LANTERN Learning Update

The student performs Q\-learning on the product MDPℳtgt×𝒟tgt\\mathcal\{M\}^\{\\text\{tgt\}\}\\times\\mathcal\{D\}^\{\\text\{tgt\}\}with integrated multi\-source guidance\.

#### Unified update rule\.

At timesteptt, after observing\(st,ωt,at,rt,st\+1,ωt\+1\)\(s\_\{t\},\\omega\_\{t\},a\_\{t\},r\_\{t\},s\_\{t\+1\},\\omega\_\{t\+1\}\)with\(st,ωt\)∈𝒮tgt×Ωtgt\(s\_\{t\},\\omega\_\{t\}\)\\in\\mathcal\{S\}^\{\\text\{tgt\}\}\\times\\Omega^\{\\text\{tgt\}\}:

ΔQ\(\(st,ωt\),at\)=α\[τ\(st,ωt,at\)δt\+\(1−τ\(st,ωt,at\)\)Gmulti\],\\Delta Q\(\(s\_\{t\},\\omega\_\{t\}\),a\_\{t\}\)=\\alpha\\\!\\left\[\\tau\(s\_\{t\},\\omega\_\{t\},a\_\{t\}\)\\,\\delta\_\{t\}\+\\big\(1\-\\tau\(s\_\{t\},\\omega\_\{t\},a\_\{t\}\)\\big\)\\,G\_\{\\text\{multi\}\}\\right\],\(16\)where

δt=rt\+γmaxa′⁡Q\(\(st\+1,ωt\+1\),a′\)−Q\(\(st,ωt\),at\)\.\\delta\_\{t\}=r\_\{t\}\+\\gamma\\max\_\{a^\{\\prime\}\}Q\(\(s\_\{t\+1\},\\omega\_\{t\+1\}\),a^\{\\prime\}\)\-Q\(\(s\_\{t\},\\omega\_\{t\}\),a\_\{t\}\)\.
The aggregated guidance combines strategic and tactical components:

Gmulti=λADrADagg\(ωt,ωt\+1\)\+λPDgPDagg\(ωt,at\),G\_\{\\text\{multi\}\}=\\lambda\_\{\\text\{AD\}\}\\,r\_\{\\text\{AD\}\}^\{\\text\{agg\}\}\(\\omega\_\{t\},\\omega\_\{t\+1\}\)\+\\lambda\_\{\\text\{PD\}\}\\,g\_\{\\text\{PD\}\}^\{\\text\{agg\}\}\(\\omega\_\{t\},a\_\{t\}\),\(17\)where the strategic component

rADagg\(ωt,ωt\+1\)=\{QADagg\(ωt\),ωt≠ωt\+1,0,otherwise,r\_\{\\text\{AD\}\}^\{\\text\{agg\}\}\(\\omega\_\{t\},\\omega\_\{t\+1\}\)=\\begin\{cases\}Q\_\{\\text\{AD\}\}^\{\\text\{agg\}\}\(\\omega\_\{t\}\),&\\omega\_\{t\}\\neq\\omega\_\{t\+1\},\\\\ 0,&\\text\{otherwise\},\\end\{cases\}and the tactical component

gPDagg\(ωt,at\)=πteacheragg\(at∣st,ωt\)−πstudent\(at∣\(st,ωt\)\),πstudent\(a∣\(s,ω\)\)=softmax\(Q\(\(s,ω\),⋅\)\)\.g\_\{\\text\{PD\}\}^\{\\text\{agg\}\}\(\\omega\_\{t\},a\_\{t\}\)=\\pi\_\{\\text\{teacher\}\}^\{\\text\{agg\}\}\(a\_\{t\}\\mid s\_\{t\},\\omega\_\{t\}\)\-\\pi\_\{\\text\{student\}\}\(a\_\{t\}\\mid\(s\_\{t\},\\omega\_\{t\}\)\),\\quad\\hskip\-8\.53581pt\\pi\_\{\\text\{student\}\}\(a\\mid\(s,\\omega\)\)=\\mathrm\{softmax\}\(Q\(\(s,\\omega\),\\cdot\)\)\.
Whenτ→1\\tau\\to 1, learning reduces to standard TD updates; whenτ→0\\tau\\to 0, updates rely primarily on aggregated teacher guidance, enabling graceful degradation under source misalignment\.

## 4Experimental Evaluation

We evaluate LANTERN to answer: \(1\) Does LANTERN achieve superior sample efficiency vs\. baselines? \(2\) How do components contribute to performance? \(3\) Does LANTERN maintain robustness under poor source alignment?

### 4\.1Experimental Setup

#### Environments\.

We evaluate on two domains with distinct task structures:

Dungeon Quest \(20×20 navigation\):Sequential collection of key, shield, chest→sword, dragon defeat with strict temporal ordering\(Alinejad2025NEUS\)\.

Blind Craftsman \(25×25 resource management\):Multiple gather→craft→deliver cycles with inventory constraints \(wood capacity: 2, product capacity: 3\)\(Alinejad2025NEUS\)\.

#### Multi\-source knowledge\.

We construct source bases where*individual sources have different goals*than targets, testing partial knowledge aggregation:

Dungeon Quest Sources:\(1\)*Rescue Mission*\(5×5\): Find map → locate victim → get medkit → return base\. \(2\)*Treasure Hunt*\(6×6\): Find clue → decode → get shovel → dig treasure\. Neither solves combat or multi\-item states, yet LANTERN leverages complementary sequential knowledge \(e\.g\., “gather\_key” has sim=0\.89 with “find\_map”\)\.

Blind Craftsman Sources:\(1\)*Mining Operation*\(7×7, 16\-state DFA\): Collect ore → smelt ingots → deliver depot\. \(2\)*Farming Operation*\(8×8, 8\-state DFA\): Plant seeds → harvest crops → deliver market\. Identical constraints but different semantics \(e\.g\., “craft\_product” has sim=0\.87 with “smelt\_ore”\)\.

#### Implementation\.

Tabular Q\-learning on product MDPs\. Teachers train 500\-600 episodes; students train 2000 \(DQ\) and 1000 \(BC\) episodes with max 1500 and 2500 steps\.α∈\[0\.6,0\.7\]\\alpha\\in\[0\.6,0\.7\],γ=0\.95\\gamma=0\.95,ϵ\\epsilon\-greedy decay 0\.9992\-0\.9997\. LANTERN:M=3M=3,η=0\.01\\eta=0\.01,kexp=ksem=5\.0k\_\{\\text\{exp\}\}=k\_\{\\text\{sem\}\}=5\.0,θexp=0\.5\\theta\_\{\\text\{exp\}\}=0\.5,θsem=0\.3\\theta\_\{\\text\{sem\}\}=0\.3,λAD=0\.15\\lambda\_\{\\text\{AD\}\}=0\.15,λPD=0\.7\\lambda\_\{\\text\{PD\}\}=0\.7\. Results averaged over 5 seeds\.

#### Baselines\.

We compare LANTERN against four baselines:No Transferuses standard Q\-learning with LLM\-generated DFA but no source knowledge transfer\.Automaton Distillation \(AD\)\(Singireddy2023AutomatonDistillation;Alinejad2025NEUS\)transfers strategic guidance from a single source task using automaton transition values with exponential decay weighting \(ρ=0\.99\\rho=0\.99\)\.CADENT\(Alinejad2026Hybrid\)combines strategic and tactical guidance from a single source with experience\-based gating that adapts transfer based on TD\-error volatility\.LARM\(Creus2024ARMFM\)generates automaton structures via LLMs and transfers knowledge from a single source using embedding\-based reward shaping\.LANTERNis our full framework with multi\-source aggregation and dual\-volatility gating\. For single\-source baselines \(AD, CADENT, LARM\), we use the most semantically similar source to the target task \(Rescue Mission for Dungeon Quest, Mining Operation for Blind Craftsman\) to provide the strongest possible comparison\.

### 4\.2Main Results

#### Dungeon Quest\.

LANTERN achieves 38% higher final reward than No Transfer\. Compared to single\-source methods: 42% improvement over LARM, 15% over CADENT in early learning \(episodes 0\-500\)\. Multi\-source analysis shows dynamic weighting: “gather\_key” assigns 62% to Rescue Mission, 38% to Treasure Hunt; “fully\_equipped” reverses to 45%/55%\.\(Figure[1](https://arxiv.org/html/2605.05478#S4.F1), left\)\.

#### Blind Craftsman\.

Despite cross\-domain semantics \(wood/product vs\. ore/ingot vs\. seed/crop\), LANTERN achieves 32% higher reward than No Transfer\. Semantic embeddings align “craft\_product” to both “smelt\_ore” \(sim=0\.87, weight=0\.58\) and “harvest\_crops” \(sim=0\.79, weight=0\.42\)\. Handles structural mismatch: weights Mining 72% in multi\-cycle states, equalizes to 51%/49% in delivery\. \(Figure[1](https://arxiv.org/html/2605.05478#S4.F1), right\)\.

![Refer to caption](https://arxiv.org/html/2605.05478v1/dq_re_ln.png)![Refer to caption](https://arxiv.org/html/2605.05478v1/dq_se_ln.png)![Refer to caption](https://arxiv.org/html/2605.05478v1/dq_rs_ln.png)

![Refer to caption](https://arxiv.org/html/2605.05478v1/bc_re_ln.png)![Refer to caption](https://arxiv.org/html/2605.05478v1/bc_se_ln.png)![Refer to caption](https://arxiv.org/html/2605.05478v1/bc_rs_ln.png)

Figure 1:Main Results\.\(Left\) Dungeon Quest\. \(Right\) Blind Craftsman\.

### 4\.3Ablation Studies

We ablate on Blind Craftsman comparing:LANTERN \(Full\),No Semantic Gating\(experience\-only\),Single Source\(Mining only, dual\-volatility\),Strategic Only\(no policy distillation\)\.

Key findings:Multi\-source vs\. Single\-Source: 26% improvement –aggregating partial knowledge from multiple sources outperforms single structurally\-similar source\. Dual\-volatility vs\. Experience\-only: 18% improvement–semantic gating prevents negative transfer in poorly\-aligned regions\. Strategic\+Tactical vs\. Strategic\-only: 31% improvement–synergy provides both coarse task decomposition and fine action guidance\. \(Figure[2](https://arxiv.org/html/2605.05478#S4.F2)\)\.

![Refer to caption](https://arxiv.org/html/2605.05478v1/ab_re_ln.png)

![Refer to caption](https://arxiv.org/html/2605.05478v1/ab_se_ln.png)

![Refer to caption](https://arxiv.org/html/2605.05478v1/ab_rs_ln.png)

Figure 2:Ablation study\.All components contribute synergistically: multi\-source aggregation \(26%\), dual\-volatility gating \(18%\), strategic\+tactical guidance \(31%\)\.
### 4\.4Discussion

Across both domains, multi\-source aggregation consistently outperforms single\-source transfer \(23–42%\) by combining complementary knowledge from semantically diverse tasks with different goals\. Rather than relying on a single structurally similar source, LANTERN selectively integrates partial progressions that align at different stages of the task\.

Dual\-volatility gating further stabilizes learning by attenuating teacher influence in poorly aligned regions while preserving guidance when both semantic alignment and learning instability are present\. This adaptive behavior prevents negative transfer without sacrificing sample efficiency\.

Overall, LANTERN achieves 35–58% improvements in sample efficiency across distinct task structures\. The results suggest that semantic alignment at the automaton level provides an effective mechanism for transferring structured knowledge across domains with differing resources, layouts, and task semantics\.

## 5Conclusion

We presented LANTERN, a multi\-source neurosymbolic transfer framework that addresses three practical bottlenecks: manual automaton specification, reliance on a single source, and static knowledge integration\. LANTERN combines LLM\-generated automata, semantic multi\-source aggregation, and dual\-volatility gating to enable adaptive transfer across heterogeneous tasks\. By aligning automaton states in a shared embedding space and regulating teacher influence through experience and semantic uncertainty, the framework integrates both strategic and tactical guidance within a unified learning update\.

Empirically, LANTERN achieves 40–60% improvements in sample efficiency while remaining robust to poorly aligned sources\. Ablation studies confirm that multi\-source aggregation, semantic gating, and multi\-level guidance each contribute substantially to performance\.

Limitations and future work\.Current limitations include reliance on LLM quality for DFA generation, tabular learning scalability, and action\-space alignment requirements\. Future directions include automaton refinement with feedback, deep function approximation for continuous domains, learned action mappings, and continual multi\-source transfer\.

\\acks

This work was supported by DARPA under Agreement No\. HR0011\-24\-9\-0427 and NSF under Award CCF\-2106339\.

## References
LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

Similar Articles

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

A Longitudinal Attribute-Conditioned Neural Network for Modeling Health-State Transition Probabilities in Temporally Irregular Data: The LANTERN Framework

ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Submit Feedback

Similar Articles

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations
Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies
A Longitudinal Attribute-Conditioned Neural Network for Modeling Health-State Transition Probabilities in Temporally Irregular Data: The LANTERN Framework
ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning
Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training