SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation
Summary
SOLAR proposes a self-optimizing autonomous agent that leverages parameter-level meta-learning and multi-level reinforcement learning to enable lifelong adaptation of LLMs to non-stationary data streams, outperforming baselines on reasoning tasks.
View Cached Full Text
Cached at: 05/22/26, 08:46 AM
# SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation
Source: [https://arxiv.org/html/2605.20189](https://arxiv.org/html/2605.20189)
\\copyrightclause
Copyright for this paper by its authors\. Use permitted under Creative Commons License Attribution 4\.0 International \(CC BY 4\.0\)\.
\\conference
1st Streaming Continual Learning Bridge at AAAI26, January 21, 2026, Singapore\.
\[orcid=0009\-0003\-6542\-324X, email=nitinvetcha@iisc\.ac\.in, url=https://github\.com/nitinvetcha/, \]\\cormark\[1\]
\[orcid=0000\-0002\-3042\-9161, email=dianbo@nus\.edu\.sg, url=https://www\.asintelligence\.xyz/, \]
\\cortext
\[1\]Corresponding author\.
Nitin VetchaDepartment of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, SingaporeDepartment of Computational and Data Sciences, Indian Institute of Science, Bangalore, Karnataka, India
\(2026\)
###### Abstract
Despite the remarkable success of large language models \(LLMs\), they still face bottlenecks while deploying in dy namic, real\-world settings with primary challenges being concept drift and the high cost of gradient\-based adapta tion\. Traditional fine\-tuning \(FT\) struggles to adapt to non stationary data streams without resulting in catastrophic for getting or requiring extensive manual data curation\. To ad dress these limitations within the streaming and continual learning paradigm, we propose the Self\-Optimizing Lifelong Autonomous Reasoner \(SOLAR\) which is an open\-ended au tonomous agent that leverages parameter\-level meta\-learning to self\-improve, treating model weights as an environment for exploration\. It initiates the process by consolidating a strong prior over common\-sense knowledge making it effective for transfer\-learning\. By utilizing a multi\-level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies, enabling efficient test\-time adaptation to unseen domains\. Crucially, SOLAR maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity \(adaptation to new tasks\) and stability \(retention of meta\-knowledge\)\. Experiments demonstrate that SOLAR outperforms strong baselines on common\-sense, mathemati cal, medical, coding, social and logical reasoning tasks, marking a significant step toward autonomous agents capable of lifelong adaptation in evolving environments\.
###### keywords:
Continual Adaptation\\sepLifelong Learning\\sepSelf\-Evolution\\sepTest\-Time Adaptation\\sepTransfer\-Learning\\sepLarge Language Models
## 1Introduction
Large Language Models \(LLMs\) possess remarkable emer gent abilities due to massive pretraining\. However, deploy ing them in streaming environments reveals a critical weak ness which is the inability to adapt to non\-stationary data distributions \(concept drift\) without expensive retraining or human intervention\. While Parameter\-Efficient Fine\-Tuning \(PEFT\) techniques like LoRA \(Hu et al\. 20222\) reduce the parameter update volume, they remain static solutions that do not inherently address the stability\-plasticity dilemma central to Continual Learning \(CL\)\. Existing adaptation strategies often rely on generic, hand crafted heuristics that fail to generalize across the shifting temporal dependencies of real\-world streams\. This disconnect necessitates a system that can not only adapt parameters on the flybut also learn how to adapt based on accumulating experience\. We propose that the high\-dimensional weight space of an LLMcontains rich meta\-knowledge that, if navi gated autonomously, can yield bespoke adaptation strategies for novel tasks\. This motivates our primary research question:
RQ: Can LLMs learn to modify their internal representation space autonomously to handle concept drift, analogous to how humans assimilate and restructure knowledge in lifelong learning scenarios?
To answer this, we investigate the cognitive science of life long learning\. As humans, we do not merely memorize new data, instead we restructure our internal schematics in order to be able to accommodate new information while simultaneously retaining prior heuristics\. This process is what has essentially enabled humans to navigate non\-stationary environments\. For instance, a student adapts their study strategy based on the nature of a new subject \(plasticity\) without unlearning how to study generally \(stability\)\. Current LLM adaptation, by contrast, is often rigid since models consume task data “as\-is”, failing to develop bespoke internal trans formation strategies\. To replicate this cognitive flexibility, we introduce SOLAR \(Self\-Optimizing Lifelong Autonomous Reasoner\)\. It functions as a meta\-learning agent that decouples rapid task adaptation \(streaming machine learning\) from long\-term strategy retention \(continual learning\)\. By discovering and validating parameter\-level modifications, SOLAR enables efficient adaptation to unseen tasks while populating a persistent knowledge base to mitigate catastrophic forgetting\. This work thus bridges the gap between static parameter generation and dynamic, lifelong self\-evolution\. Further more, by grounding the search space in neural network weights, we target generalized principles of model capability rather than task\-specific memorization\. Just as scaling laws \(Kaplan et al\. 2020\) predict performance based on size, we posit that predictable weight\-modification patterns exist that allow for rapid, data\-efficient adaptation to concept drift, minimizing the lag between detecting a distributional shift and deploying an updated model\. The remainder of this paper is organized as follows\. In Sec tion 2, we highlight the motivation for our approach in detail\. Section 3 presents the literature survey conducted, Section 4 contains the methodology with implementation specifics in Section 5\. Experimental results are provided in Section 6 and in Section 7, we present our concluding remarks\.
## 2Motivation
Our primary motivation stems from human psychology and pedagogy\. For example, consider a human student who is preparing to take an end\-sem examination of a machine learning course\. Quite often, students tend to rely on their prior prepared notes for preparation\. These notes are often derived from the lecture content, textbooks or information available on the internet\. Thus instead of relying on the raw content, students assimilate and rewrite the information in the form of notes as per their own intrinsic reasoning skill and aptitude\. This improves the capability of students to comprehend the content better and therefore respond well to the exam questions\. This phenomenon of reinterpreting and augmenting external knowledge in a way that is easier to understand as well as developing the necessary skill\-sets is not limited to just taking exams, but seems to be universally true of human learning across tasks\. Furthermore, depending on one’s interests, humans assimilate information in different ways \- some might condense the information into a visual diagram, some into text, or some might rely more on concrete mathematical descriptions\. Such restructuring or development of internal knowledge as well as assimilation or rewriting of external information, as part of the learning process is in contrast with how LLMs undergo currently training and adaptation\. Given a new task, current LLMs consume and learn from the task data "as\-is" via finetuning or in\-context learning\. The issue with this, just like in the human setting, such data may not be in an optimal format \(or volume\) for learning, or there might not be the relevant skill\-set developed to learn it and current approaches do not enable models to develop bespoke strategies for how to best transform themselves internally or even learn from their training data\. In this work, we therefore investigate the question as to if it is possible for even LLMs, analogous to humans, to suggest strategies by themselves which can enable them to perform better on a given task\.
A secondary source of motivation as to why we ground our strategy search space in the neural network weights is because unlike task\-specific knowledge, the weight\-level meta\-knowledge represents generalized principles about how neural network parameters relate to model capabilities, thereby providing crucial insights for self\-evolving agents\. There are several prior research works which have shown that there exists a positive correlation between types of neural network weight patterns and downstream model performance characteristics\. For example, scaling laws research\[kaplan2020scaling\]has demonstrated that there are predictable relationships between model size and performance\. Similarly, structured sparsity learning gives an indication so as to how particular weight patterns can be useful for developing more efficient representations\[wen2016learning\]\.
## 3Related Work
Test\-Time Training\(TTT\) is a recently emerging class of approaches which updates model weights at inference time using techniques such as input perplexity or cross\-entropy minimization on only unlabeled test data enabling self supervised enhancement of LLM performance\[hu2025testtimelearninglargelanguage,hu2025slotsamplespecificlanguagemodel\]or via reinforcement learning by utilizing the priors in the pre\-trained models\[zuo2025ttrltesttimereinforcementlearning\]or by using reflection and verifier\-driven sample selection\[moradi2025continuousselfimprovementlargelanguage,lee2025reviselearningrefinetesttime\]or by using a task\-specific curriculum\[hübotter2025learningjobtesttimecurricula\]or by using a mixture\-of\-expert based model merging\[bertolissi2025localmixturesexpertsessentially\]\. An alternative approach is to scale inference compute at test time as well using for example ensemble approaches such as majority voting\. While test\-time approaches is a promising option, such a computational overhead might not be necessary always and it often fails in cases where data is scarce or quality of unlabeled data is poor\.
Adversarial Fine Tuningis another emerging class of techniques where in two LLM instances are made to debate with each other about a topic or one instance serves as a challenger or teacher and the other instance serves as a solver or student to generate synthetic data, either from unlabeled prompts or even from scratch itself and use approaches like majority voting to create pseudo\-labels which can further be used for updating model’s knowledge accordingly\[yang2024syntheticcontinuedpretraining,wang2025selfupdatablelargelanguagemodels,wang2025lokilowdamageknowledgeimplanting\]\. This can also be done by some additional fine\-tuning using information which is available in the LLM’s context as well\[park2025textitnewnewssystem2finetuning\]similar to knowledge distillation\. Recent works include SQLM\[chen2025self\], R\-Zero\[huang2025rzeroselfevolvingreasoningllm\], TT\-SI\[acikgoz2025selfimprovingllmagentstesttime\], SIRLC\[pang2023languagemodelselfimprovementreinforcement\]\. While this is an efficient approach in data scarce domains where TTT fails, it is not always efficient as there are certain challenging domains which require mastering novel reasoning skills and it is well known that scaling data isn’t sufficient in this regimes such as mathematics\[hendrycks2021measuringmathematicalproblemsolving\]\.
Reinforcement Learning\(RL\) is a well established approach for pushing the capabilities of LLMs and recent works such as SEAL\[zweiger2025selfadaptinglanguagemodels\], RLAIF\[li2025curriculumrlaifcurriculumalignmentreinforcement\], SRLM\[yuan2025selfrewardinglanguagemodels\]and Memento\[zhou2025mementofinetuningllmagents\], which uses a memory\-based online RL policy have shown promising potential in the low\-cost continual adaptation of LLMs\. In RL,meta\-learninghas been used as well in order train agents in scenarios where it needs to learn novel tasks quickly\[gupta2018metareinforcementlearningstructuredexploration\]\. SOLAR can be seen as thus following meta\-learning principles since it learns an adaptation strategy i\.e\., how to generate effective self weight update using a meta optimization loop\. Closely, related areself\-referentialsystems as well which learn to update their own parameters as in\[irie2022modernselfreferentialweightmatrix\]andself\-evolvingagents which enable LLM to improvise by autonomously acquiring, refining and learning from experiences generated by the model itself\[tao2024surveyselfevolutionlargelanguage,gao2025surveyselfevolvingagentspath\]\. While RL based approaches are quite good, its often challenging to achieve convergence and design optimal policies which are efficient in terms of compute and time as well\.
Parameter Generationis another research direction which has seen several pioneering works such as RPG\[wang2025recurrent\], DnD\[liang2025drag\], T2L\[charakorn2025texttolorainstanttransformeradaption\], ORAL\[khan2025oralpromptinglargescaleloras\], COND P\-DIFF\[jin2024conditionalloraparametergeneration\]\. DnD generates task\-specific parameters from unlabeled prompts without per\-task training via a prompt\-conditioned hyper\-convolutional decoder while T2L does the same but uses a hyper\-network and task description instead\. ORAL leverages architectural and textual conditioning for flexible, scalable LoRA parameter adaptation\. RPG introduces a recurrent diffusion architecture for scalable unconditional LoRA parameter generation\. COND P\-DIFF applies conditional latent diffusion for controllable LoRA parameter synthesis with strong cross\-domain generalization\. An associated direction ismodel mergingas well, which facilitates generalization to unseen tasks viamulti\-task learning\[shao2025icmfusionincontextmetaoptimizedlora,shao2025incontextmetalorageneration\]\. While these works have been effective, the limitation is that these are static parameters which once generated do not undergo any further modification but this feature is crucial for domains requiring the implicit meta\-knowledge\.
## 4Methodology
In this section, we describe the framework of our proposed approach \(see Figure[1](https://arxiv.org/html/2605.20189#S4.F1)\)111\[zhang2025toward\]also provides motivation for the development of self\-improving agents at the weight level, however it only provides description of a conceptual framework with no implementation or empirical validation\.\. SOLAR starts by treating the LLMs own weights as environment variables to explore, upon which it would systematically propose scientific hypotheses to modify the internal representation space appropriately so as to adapt the LLM to the unseen task\. A major challenge for the design, therefore, is the high dimensionality and non\-convexity of the LLM weight space itself which makes the initialization and subsequent exploration process extremely complex\. To overcome this, we work only with low\-rank parameters\[hulora\]which constitutes a much smaller fraction \(∼1%\\sim 1\\%\) of the original model’s weights\. In addition, to avoid the limitations arising from selecting a single starting point, which might not be optimal to wiggle around, we prefer to sample from a plausible weight distribution space\. This step is essential to eliminate the risk of non\-convergence\. To get this initial distribution for weights i\.e\., self\-weight sampling, we refer to prior works in large\-scale LLM parameter generation and use a convolution\-based decoder architecture as the backbone for SOLAR’s exploration point initializer\.
Once the weights have been initialized222These weights can optionally be encoded into a structured representation correlated with network performance like world models such as JEPA\[lecun2022path\]\.for exploration, SOLAR then uses a foundation\-model\-based agent, which is for now simply an LLM trained using reinforcement learning \(RL\) to come up with probable hypotheses at inference time for weight\-space exploration using test\-time scaling and compute\. To however, facilitate the training process, it is necessary to first curate by hand a seed knowledge base, consisting of either proven or plausible weight modification strategies, which will then serve as the action space for LLM’s initial stages of exploration during RL training\. This would be a multi\-stage recipe consisting of three distinct progressively harder levels\. Level I consists of training the LLM to produce only single valid and efficient self\-edits \(a self\-edit as the name suggests is basically a modification strategy proposed by an LLM to update its own weights depending on the task\) from among the ones present in the initial knowledge base\. Level II comprises of training the model to output chain\-of\-self\-edits, since coupling strategies sequentially is also helpful \(moreover if viewed in a abstract sense, it can be considered in effect as a single complex edit which can be decomposed into simpler instances\)\. Level III is a significantly challenging aspect both for the LLMs as well as from implementation perspective as well, which is basically letting LLMs to explore the hypothesis space in its entirety, thereby going beyond human\-crafted approaches\. A positive performance in Level III would be a significant leap as it could possibly open up new frontiers in training and fine\-tuning paradigms as has been similarly done in other areas as well such as neural architecture search\[liu2025alphagomomentmodelarchitecture\]and optimization\[lu2024discoveringpreferenceoptimizationalgorithms\]\.
Figure 1:SOLAR’s methodology of weight\-level meta\-knowledge discovery and modification summarized \(adapted from\[zhang2025toward\]\)After plausible hypothesis have been generated by the foundation model\-based agent and implemented, its necessary to test the hypothesis\. For this purpose, we create a separate evaluation split if available\. However, since SOLAR is designed to adapt LLMs efficiently to unseen tasks as well, the dataset for evaluation itself would be generated on the fly using adversarial approaches involving multiple instances of an LLM, one proposing and one solving questions on a particular topic as in SQLM\[chen2025self\]or R\-Zero\[huang2025r\]\. Once the hypothesis has been tested and is found to be valid \(as in it improves performance in some pre\-determined metric such as accuracy on the`eval`set\), it would be then added back into the knowledge base, thereby enriching the action space of LLM for future iterations\. In order to prevent catastrophic forgetting, SOLAR implements a meta\-level weight regularization technique as well\. Therefore, by automating the process of self\-improvement using principled methodologies and meta\-knowledge in a scientific manner \(i\.e\., propose, validate and accept hypotheses\), SOLAR provides a holistic framework towards the next generation ofAI generating AIagents, because as soon as web\-scale data corpora is exhausted, progress will hinge on a model’s capacity to generate its own high\-utility training signal\.
## 5Implementation
### 5\.1Architecture
Primary architectural detail in SOLAR’s framework is the design of the weight\-space exploration initializer\. As mentioned in Section[4](https://arxiv.org/html/2605.20189#S4), we use a convolution based decoder model for this purpose\. We assume that we have access to either the unseen task’s description or atleast a handful of unlabeled examples representative of its requirements\. We then send them to an open\-sourced text encoder for embedding extraction\. This extraction process can be formally represented as,ci=Encoder\(pi,θ\),c\_\{i\}=\\mathrm\{Encoder\}\(p\_\{i\},\\theta\),whereEncoder\(⋅,⋅\)\\mathrm\{Encoder\}\(\\cdot,\\cdot\)denotes the embedding extraction function parameterized byθ\\theta, andcic\_\{i\}represents the extracted embedding corresponding to promptpip\_\{i\}\. We use an encoder\-based language model architecture for this purpose i\.e\., Sentence\-BERT \(all\-MiniLM\-L6\-v2 specifically\)\[reimers2019sentence\]333It is to be noted that BERT’s supported sequence length is only 512 and for longer sequences, padding should be done\. However, in our use case, maximum sequence length is only 384 and thus padding is not necessary\.\.
Next, following\[wang2025recurrent\], is the parameter tokenization process \(see Figure[2](https://arxiv.org/html/2605.20189#S5.F2)\), which is done so as to preserve both the layer\-wise distribution and the cross\-layer correlations\. Specifically, \(i\) weights are split according to their layer indices, \(ii\) layer\-wise normalization is applied to mitigate distribution shifts, \(iii\) parameters are sliced into non\-overlapping tokens with uniform size, and \(iv\) a lightweight permutation state \(encoded as a one\-hot vector\) is used to alleviate symmetry issues\[kunin2020neural\]when collecting multiple checkpoints\. Additionally, 2D position embeddings \(first dimension encodes layer index, while second dimension captures the token’s in\-layer position\)\.\[dosovitskiy2020image\]are employed to ensure the network retains positional awareness of each token within the entire set\. In our case, each LoRA matrix is of shape8×8968\\times 896, which is then split into 7 smaller chunks, each with a shape of8×1288\\times 128, which is then finally padded to a uniform size of10×13010\\times 130\.
Figure 2:Details of the Parameter Tokenization ProcessSay, the dimension of prompt embeddings is\[B,N,L,C\]\[B,N,L,C\]whereB,N,LandCB,N,L\\text\{ and \}Cdenote batch size, length of prompt batch \(i\.e\., number of prompts\), sequence length, and hidden dimension, respectively\. The decoder \(see Figure[3](https://arxiv.org/html/2605.20189#S5.F3)\) consists of multiple sequential layers, each performing 5 2D convolutions\. These convolutions are divided into three categories: i\)width convolutionthat operates on\(C,L\)\(C,L\)dimension, ii\)height convolutionthat operates on\(L,N\)\(L,N\)dimension\) iii\)layer\-wise convolutionthat on\(N,L\)\(N,L\)dimension\) , with notationsConvW\\text\{Conv\}\_\{W\},ConvH\\text\{Conv\}\_\{H\}, andConvL\\text\{Conv\}\_\{L\}\. Each layer consists of twoConvW\\text\{Conv\}\_\{W\}, twoConvH\\text\{Conv\}\_\{H\}and oneConvL\\text\{Conv\}\_\{L\}\. Given this, the forward operation of the decoder block is,
cWl=ConvH1\(ConvW1\(cl−1\)\)\\displaystyle c^\{l\}\_\{W\}=\\text\{Conv\}^\{1\}\_\{H\}\(\\text\{Conv\}^\{1\}\_\{W\}\(c^\{l\-1\}\)\)cHl=ConvW2\(ConvH2\(cl−1\)\)\\displaystyle c^\{l\}\_\{H\}=\\text\{Conv\}^\{2\}\_\{W\}\\left\(\\text\{Conv\}^\{2\}\_\{H\}\(c^\{l\-1\}\)\\right\)cl=ConvL\(\(cWl\+cHl\+b\)/3\)\\displaystyle c^\{l\}=\\text\{Conv\}\_\{L\}\\left\(\(\{c^\{l\}\_\{W\}\+c^\{l\}\_\{H\}\+b\}\)/\{3\}\\right\)whereclc^\{l\}is hidden state output by thellth layer,c0c^\{0\}is prompt embedding encoded by the condition extractor, andbbis learnable bias\. Through this process, input is transformed from dimension\[B,N,L,C\]\[B,N,L,C\]to\[B,N′,L′,C′\]\[B,N^\{\\prime\},L^\{\\prime\},C^\{\\prime\}\]which is then compatible to be converted into a flattened LoRA adapter for the LLM444In our present implementation, the entire flow is \(128,384,384\)→\\to\(128,200,300\)→\\to\(128,100,256\)→\\to\(256,50,200\)→\\to\(512,50,200\)→\\to\(1024,25,200\)→\\to\(1024,10,200\)→\\to\(2048,10,200\)→\\to\(4296,8,128\)\. In this work, the base LLM used is Qwen2\.5\-0\.5B\-Instruct\[qwen2025qwen25technicalreport\]and LoRA is applied to the linear projection layers within both the self\-attention mechanism and the MLP blocks of the transformer architecture\. Specifically, this includes the query, key, value and output projections in attention blocks, as well as the gate, up and down projections in MLP blocks\.
Figure 3:Details of the Hyper\-Convolutional Decoder Architecture used
### 5\.2Training
In this work, we focus on the domain of common\-sense reasoning and select 4 datasets for evaluation, namely HellaSwag\[zellers2019hellaswagmachinereallyfinish\], BoolQ\[clark2019boolqexploringsurprisingdifficulty\]as well as the challenge and easy set of AI2 Reasoning Challenge \(ARC\)\[clark2018thinksolvedquestionanswering\]\. ARC dataset contains grade\-school level, multiple\-choice science questions\. HellaSwag instructs models to select from choices that best finish the sentence among ground truth and an adversarial set of machine\-generated wrong answers\. BoolQ is a question answering dataset for yes/no questions containing various factual problems\. We use existing checkpoints of these datasets555For training however, even Open\-Book Question Answering or OBQA\[mihaylov2018can\], Physical Interaction: Question Answering or PIQA\[bisk2019piqareasoningphysicalcommonsense\]and WinoGrande\[sakaguchi2019winograndeadversarialwinogradschema\]have been used as well\. OBQA aims to promote research in advanced question\-answering with salient facts summarized as an open book\. PIQA focuses on everyday situations with a preference for a typical solutions\. WinoGrande features a fill\-in\-a\-blank task with binary options for commonsense reasoning questions\.\(batch size was 32 and number of samples was 5000\) which have been collected by first pretraining on the target dataset for 75 steps with a learning rate of 1e\-4 and then performing fine\-tuning on the target dataset for 50 additional steps with a learning rate of 1e\-5, while saving a checkpoint at each step\.
Subsequently, prompt\-checkpoint pairing is done as follows\. Given a datasetPP, it is first divided it into non\-overlapping prompt batches\[p1,⋯,pi,⋯,pI\]\[p\_\{1\},\\cdots,p\_\{i\},\\cdots,p\_\{I\}\]\. Denote the trained LLM checkpoints of this dataset asM=\[m1,⋯,mj,⋯,mJ\]M=\[m\_\{1\},\\cdots,m\_\{j\},\\cdots,m\_\{J\}\]\. Then randomly a batch of prompts and a corresponding checkpoint is picked to create a pair\{pi,mj\}\\\{p\_\{i\},m\_\{j\}\\\}, which then serves as an input\-output data point for training the decoder\. The objective function for training is the mean squared error \(MSE\) loss between the output from the decoder’s last block for a particular prompt batch and the training checkpoint associated with it\.
Next crucial step is the hand\-crafting of seed knowledge base\. To this end, we identify five primary families of strategies666Unfortunately, there are no research works highlighting approaches for optimizing the performance of LoRA’s obtained via the process of parameter generation, thereby posing a major challenge in identification of plausible strategies, which had to be cherry\-picked via trial and error\., each containing its own sub\-strategies as well, namely
- •Test\-Time Training \(TTT\) using input perplexity minimization\[hu2025testtimelearninglargelanguage\]or via reinforcement learning\[zuo2025ttrltesttimereinforcementlearning\]for example by using self\-reflection and verification loops like GEPA\[agrawal2025gepareflectivepromptevolution\], ReflectEvo\[li2025reflectevoimprovingmetaintrospection\], REVISE\[lee2025reviselearningrefinetesttime\]or Instruct\-of\-Reflection\[liu2025instructofreflectionenhancinglargelanguage\]\. It could also involve prompt optimization using frameworks like TextGrad\[yuksekgonul2024textgradautomaticdifferentiationtext\]or CAST\[tang2025enhancingcrosstasktransferlarge\]
- •Post\-training data\-free LoRA modifications such as mixing LoRA subspaces obtained by weight decomposition of constituent matrices\[wu2025mixtureofsubspaceslowrankadaptation\]or bounding norm of selected parameters\[wang2025normboundedlowrankadaptation\]or evening merging multiple task\-specific LoRA adapters\[zhao2024mergingloraslikeplaying\]
- •SQLM\[chen2025selfquestioninglanguagemodels\], R\-Zero\[huang2025rzeroselfevolvingreasoningllm\]or SEAL\[zweiger2025selfadaptinglanguagemodels\]like reinforcement learning based frameworks which enable LLMs to self\-adapt by generating their own finetuning data and update directives \(another example is TT\-SI\[acikgoz2025selfimprovingllmagentstesttime\]\)
- •Test\-Time Scaling \(TTS\) using either a router or an ensemble approach i\.e\., we generate and perform inference with multiple adapters obtained by using different representative prompt batches and to obtain the final prediction, select either the most confident prediction \(max\_confidence\) or by a majority vote or sum\_logprobs \(i\.e\., sum log probabilities across adapters per prediction and pick the one with highest total logprob\)
- •Latent Space \(LS\) Approaches which aim at working or modifying the internal layers\[hu2025slotsamplespecificlanguagemodel\]or hidden activations\[zhang2025latentevolveselfevolvingtesttimescaling\]directly of the LLM\. It may also involve decoding algorithms which modify the sampling procedure itself\[karan2025reasoningsamplingbasemodel,wang2025endmanualdecodingtruly\]\. We consider them as part of latent space family because they tamper with internal probability distribution of next\-tokens unlike other families which modify the parameters explicitly\.
We first formulate the objective for outer\-loop RL training which generates adaptation strategiesAS, as in\[zweiger2025selfadaptinglanguagemodels\]\. Letθ\\thetadenote the parameters of the language modelLMθ\\texttt\{LM\}\_\{\\theta\}\. In order to adapt to an unseen dataset \(task\)𝒟\\mathcal\{D\}, SOLAR requires as specified in Section[4](https://arxiv.org/html/2605.20189#S4),CCwhich is a context containing information relevant to the task andτ\\tauwhich is the evaluation strategy and metric used to assess the model’s downstream adaptation\. Based onCC, SOLAR generates anASand updates its parameters accordinglyθ′←Update\(θ,AS\)\\theta^\{\\prime\}\\leftarrow\\texttt\{Update\}\(\\theta,\\texttt\{AS\}\)\. We thus have an RL setup i\.e\., the model takes anaction\(generatingAS\), receives arewardrrbased onLMθ′\\texttt\{LM\}\_\{\\theta^\{\\prime\}\}’s performance onτ\\tauand updates its policy to maximize expected reward,
ℒRL\(θt\):=−𝔼\(C,τ\)∼𝒟\[𝔼AS∼LMθt\(⋅∣C\)\[r\(AS,τ,θt\)\]\]\\mathcal\{L\}\_\{\\text\{RL\}\}\(\\theta\_\{t\}\):=\\,\-\\mathbb\{E\}\_\{\(C,\\tau\)\\sim\\mathcal\{D\}\}\\left\[\\mathbb\{E\}\_\{\\texttt\{AS\}\\sim\\text\{LM\}\_\{\\theta\_\{t\}\}\(\\cdot\\mid C\)\}\\left\[r\(\\texttt\{AS\},\\tau,\\theta\_\{t\}\)\\right\]\\right\]
It is to be noted that the reward assigned to a given action depends on the model parametersθ\\thetaat the time the action is taken \(sinceθ\\thetais updated toθ′\\theta^\{\\prime\}, which is then evaluated\)\. An implication of this is that the while modeling the RL state, one must therefore includeθ\\thetain the policy’s parameters as well along withCC, even though the policy’s observation is limited toCC\(because it is extremely infeasible to directly placeθ\\thetain the LLM’s context window\)\. Therefore, the \(state, action, reward\) triples which have been collected by using an older model weights,θold\\theta\_\{\\text\{old\}\}, will not be aligned for the current modelθcurrent\\theta\_\{\\text\{current\}\}\. Hence, an on\-policy approach should be adapted, by which adaptation strategies are sampled from and, even more importantly, the rewards itself will be calculated using the current model\.
In particular, the specific on\-policy approach used is ReSTEM\{\}^\{\\text\{\{EM\}\}\}\[singh2024humandatascalingselftraining\]where samples are first generated777Currently, only a deterministic number of samples are being generated, 15 to be precise\. This could however be improvised to be dynamic in future version of the work wherein samples would continue to be generated until a particular confidence threshold, as determined by the model itself is reached instead\. The same is true for number of iterations as well which is just 2 for now\.from the current model and are filtered by using binary feedback \[r\(AS,τ,θt\)r\(\\texttt\{AS\},\\tau,\\theta\_\{t\}\)is 1 if onτ\\tau,ASimprovesLMθt’s performance\\text\{LM\}\_\{\\theta\_\{t\}\}\\text\{'s performance\}and is 0 otherwise\]\. The model is then fine\-tuned on these samples and this continues in an iterative manner \(See Algorithm[1](https://arxiv.org/html/2605.20189#alg1)\)\.
A subtle detail, which hasn’t yet been covered is the exact nature of the adaptation strategy itself\. This depends on the particular strategy family being used, however the format is consistent across all which is basically a JSON object specifying the particular configurations to be used888Since the model being used is Qwen2\.5\-0\.5B\-Instruct, it was facing difficulty in following instructions given in the prompt for generation of structured outputs even after temperature alteration\. In such cases, verification and formatting was done by using Qwen2\.5\-7B\-Instruct instead\.\. It contains a field,`family`which takes values`TTT`,`LoRA`and`TTS`\. Currently, the following choices have been experimented
Figure 4:Router Approach for TTS- •For`TTT`, we use\[hu2025testtimelearninglargelanguage\]and the corresponding JSON object has fields`ttl\_steps`\(number of training steps in the TTL loop\),`learning\_rate`,`batch\_size`and`shuffle\_data`\(boolean variable\)\.
- •For`LoRA`modifications, we use two\-subspace \(TS\) mixing version from\[wu2025mixtureofsubspaceslowrankadaptation\]and the corresponding JSON object has only a single field, namely`lambda`which is a hyperparameter determining the ratio in which the two resulting subspaces must be mixed\.
- •For`TTS`, we use either an ensemble or router approach\. In the router approach \(see Figure[4](https://arxiv.org/html/2605.20189#S5.F4)\), we basically sample multiple prompt batches and choose that batch whose average of similarity scores999Cosine similarity and Euclidean distance were tested and the latter was found to perform better empirically\. Thus,avg\_sim\_scoreandavg\_prompt\_embed\.\\texttt\{avg\\\_prompt\\\_embed\}\.use euclidean distance by default\. Alternatively, measure of similarity can also be included as a new field but hasn’t been explored in the current work\.of individual prompts \(M1\) or averaged prompt embedding \(M2\), is closest to that of the question at test time\. The corresponding JSON object has fields`num\_prompt\_batches`\(indicating the number of prompt batches to be sampled from the test split of unseen dataset\) and`method`which can take one of five values \-`avg\_sim\_score`,`avg\_prompt\_embed`,`max\_confidence`,`majority\_vote`or \(summing log probabilities\) i\.e\.,`sum\_logprobs`\(former two belong to router approach and the latter three constitute the ensemble approach\)\.
- •For`LS`, we use\[hu2025slotsamplespecificlanguagemodel\]and the corresponding JSON object has fields`times`and`learning\_rate`\.
Algorithm 1Sequential Multi\-Level RL Loop for Adaptation Strategy \(AS\) Generation of SOLAR1:Input:Base LMθ, dataset context
CC, evaluation metric
τ\\tau, initial knowledge base
KK
2:Init:Low\-rank adapter generator
GG, sampled adapters
S←Sampler\(C,G\)S\\leftarrow\\texttt\{Sampler\}\(C,G\)
3:Level I \(Single\-edit self\-training\):
4:foriteration
t=1,…,T1t=1,\\dots,T\_\{1\}do
5:Propose single\-edit AS from
KKAS∼LMθ\(K,C\)\\texttt\{AS\}\\sim\\text\{LM\}\_\{\\theta\}\(K,C\)
6:Apply AS and obtain weightsθ′←ApplyStrategy\(θ,AS,S\)\\theta^\{\\prime\}\\leftarrow\\texttt\{ApplyStrategy\}\(\\theta,\\texttt\{AS\},S\)
7:EvaluateAns∼LMθ′\(⋅∣τ\)\\texttt\{Ans\}\\sim\\text\{LM\}\_\{\\theta^\{\\prime\}\}\(\\cdot\\mid\\tau\)
8:Compute rewardr←r\(Ans,τ\)r\\leftarrow r\(\\texttt\{Ans\},\\tau\)
9:if
r\>threshold1r\>\\text\{threshold\}\_\{1\}then
10:
θ←RL\_Update\(θ,r,AS\)\\theta\\leftarrow\\texttt\{RL\\\_Update\}\(\\theta,r,\\texttt\{AS\}\)
11:endif
12:endfor
13:Level II \(Chained/compositional strategies\):
14:foriteration
t=1,…,T2t=1,\\dots,T\_\{2\}do
15:Propose chain of editsAS=\[e1,…,ek\],ei∈K\\texttt\{AS\}=\[e\_\{1\},\\dots,e\_\{k\}\],\\;e\_\{i\}\\in K
16:Sequentially apply chainθ0←θ\\theta\_\{0\}\\leftarrow\\theta;θi←ApplyStrategy\(θi−1,ei,S\)\\theta\_\{i\}\\leftarrow\\texttt\{ApplyStrategy\}\(\\theta\_\{i\-1\},e\_\{i\},S\)
17:Evaluate final weightsAns∼LMθk\(⋅∣τ\)\\texttt\{Ans\}\\sim\\text\{LM\}\_\{\\theta\_\{k\}\}\(\\cdot\\mid\\tau\)
18:Compute rewardr←r\(Ans,τ\)r\\leftarrow r\(\\texttt\{Ans\},\\tau\)
19:if
r\>threshold2r\>\\text\{threshold\}\_\{2\}then
20:Add chain to KBK←K∪\{AS\}K\\leftarrow K\\cup\\\{\\texttt\{AS\}\\\}
21:
θ←RL\_Update\(θ,r,AS\)\\theta\\leftarrow\\texttt\{RL\\\_Update\}\(\\theta,r,\\texttt\{AS\}\)
22:endif
23:endfor
24:Level III \(Open\-ended exploration\):
25:foriteration
t=1,…,T3t=1,\\dots,T\_\{3\}do
26:Generate unconstrained ASAS∼LMθ\(⋅∣C\)\\texttt\{AS\}\\sim\\text\{LM\}\_\{\\theta\}\(\\cdot\\mid C\)\(novel structure\)
27:Validate \(syntax/safety\); if invalidcontinue
28:Apply AS conservatively \(strong meta\-reg\)θ′←ApplyStrategy\(θ,AS,S\)\\theta^\{\\prime\}\\leftarrow\\texttt\{ApplyStrategy\}\(\\theta,\\texttt\{AS\},S\)
29:Evaluate and compute rewardAns∼LMθ′\(⋅∣τ\)\\texttt\{Ans\}\\sim\\text\{LM\}\_\{\\theta^\{\\prime\}\}\(\\cdot\\mid\\tau\);r←r\(Ans,τ\)r\\leftarrow r\(\\texttt\{Ans\},\\tau\)
30:if
r\>threshold3r\>\\text\{threshold\}\_\{3\}then
31:
K←K∪\{AS\}K\\leftarrow K\\cup\\\{\\texttt\{AS\}\\\};
θ←RL\_Update\(θ,r,AS\)\\theta\\leftarrow\\texttt\{RL\\\_Update\}\(\\theta,r,\\texttt\{AS\}\)
32:else
33:Penalize harmful proposals in policy update
34:endif
35:endfor
36:Return:Refined parameters
θ∗\\theta^\{\*\}, enriched KB
K∗K^\{\*\}
## 6Experiments
### 6\.1Setup
As described in Section[5](https://arxiv.org/html/2605.20189#S5), the base LLM used is Qwen2\.5\-0\.5B\-Instruct, domain is common\-sense\-reasoning and evaluation datasets are ARC\-c, BoolQ, HellaSwag and ARC\-e\. Baselines used include quite recent works such as DnD\[liang2025drag\], Test\-Time Learning \(TTL\)\[hu2025testtimelearninglargelanguage\], Decoupled and Orthogonal Merging \(DOM\)101010DOM is a data\-free framework for LoRA merging\. It separates parameters into magnitude and direction components and merges them independently, thereby reducing the impact of magnitude differences on the directional alignment of the merged models, thus helping in preserving task information\. It also uses a data\-free, layer\-wise gradient descent method with orthogonal constraints to mitigate interference during the merging of direction components\. For evaluation on a target dataset, LoRA’s of remaining datasets are merged and used\.\[zheng2025decoupleorthogonalizedatafreeframework\]and average of task\-specific training LoRA’s\[hulora\]\. On one extreme, TTL uses the entire unlabeled corpus of the training LoRA’s in addition to the 128 unlabeled examples from the target dataset as seen by SOLAR\. On the other extreme, instead of using the unlabeled corpus, DOM merges all the 7 training LoRA’s inclusive of the target set\.
### 6\.2Hardware
All experiments were conducted on a high\-performance computing node running Ubuntu 22\.04\.1\. The backend processor was EPYC 8434P, which had 48 physical cores \(96 logical threads\), 256 GB of system RAM and a maximum clock speed of 2\.5 GHz\. Four NVIDIA RTX A6000 GPUs, each with 48 GB of dedicated VRAM were utilized\. Python version used was 3\.12\.11 and GPU\-accelerated tasks were managed using CUDA version 12\.4\.
### 6\.3Results
The major results of this work are presented in Table[1](https://arxiv.org/html/2605.20189#S6.T1)wherein we conduct experiments of 5 benchmarks which are in the domain of common\-sense reasoning and also on 5 out\-of\-domain benchmarks namely GSM\-MC and MATH\-MC111111GSM\-MC and MATH\-MC are multiple choice versions of the standard GSM\-8K\[cobbe2021training\]and MATH\[hendrycks2021measuring\]datasets\. They were selected for two reasons \- ease of evaluation and correlation with performance on their subjective counterparts\[zhang2024multiple\]\.to evaluate mathematical reasoning, DivLogicEval\[chung2025divlogiceval\]for logical reasoning, SocialIQA\[sap2019socialiqa\]for reasoning about social interactions and CodeMMLU\[manh2024codemmlu\]for reasoning about code\-related tasks\. It can be seen that SOLAR in its initial version itself outperforms the task\-specific training LoRA’s, TTL, DOM and even DnD by a significant margin, showcasing the promising potential it is capable of, if further levels of RL training121212This might be quite time\-intensives however with current version itself taking around 4 days using 2 A6000 GPU’s\. The reason for using only 2 despite 4, is because Qwen family has 14 attention heads and the vllm serves used for improved efficiency in inference requires this number to be divisible by the number of GPU’s which is only possible if either 2 or 7 are available\.is completed as well\.
In\-Domain TasksAvg\.Out\-of\-Domain TasksAvg\.DatasetARC\-eARC\-cBoolQHellaSwagPIQAInGSMMATHLogicSocialCodeOutLoRA47\.439\.714\.726\.351\.535\.915\.66\.820\.339\.529\.822\.4TTL24\.424\.744\.425\.951\.934\.323\.519\.726\.234\.929\.726\.8DOM56\.538\.933\.228\.318\.835\.117\.72\.624\.751\.331\.625\.6DnD70\.948\.151\.926\.547\.849\.020\.824\.121\.033\.529\.125\.7SOLAR74\.755\.558\.848\.360\.159\.530\.324\.525\.155\.035\.634\.1Δ\\DeltaDnD3\.8↑\\uparrow7\.4↑\\uparrow6\.9↑\\uparrow21\.8↑\\uparrow12\.3↑\\uparrow10\.4↑\\uparrow9\.5↑\\uparrow0\.4↑\\uparrow4\.1↑\\uparrow21\.5↑\\uparrow6\.5↑\\uparrow8\.4↑\\uparrowΔ\\DeltaDOM18\.2↑\\uparrow16\.6↑\\uparrow25\.6↑\\uparrow20\.0↑\\uparrow41\.3↑\\uparrow24\.3↑\\uparrow12\.6↑\\uparrow21\.9↑\\uparrow0\.4↑\\uparrow3\.7↑\\uparrow4\.0↑\\uparrow8\.5↑\\uparrowΔ\\DeltaTTL50\.3↑\\uparrow30\.8↑\\uparrow14\.4↑\\uparrow22\.4↑\\uparrow8\.2↑\\uparrow25\.2↑\\uparrow6\.8↑\\uparrow4\.8↑\\uparrow1\.1↓\\downarrow20\.1↑\\uparrow5\.9↑\\uparrow7\.3↑\\uparrowΔ\\DeltaLoRA27\.3↑\\uparrow15\.8↑\\uparrow44\.1↑\\uparrow22\.0↑\\uparrow8\.6↑\\uparrow23\.6↑\\uparrow14\.7↑\\uparrow17\.7↑\\uparrow4\.8↑\\uparrow15\.5↑\\uparrow5\.8↑\\uparrow11\.7↑\\uparrowTable 1:Accuracy \(in %\) of SOLAR Level I approach over the baselines TTL \(25\.2↑\\uparrow\), LoRA \(23\.6↑\\uparrow\), DOM \(24\.3↑\\uparrow\), and DnD \(10\.4↑\\uparrow\) for in\-domain tasks, and TTL \(7\.3↑\\uparrow\), LoRA \(11\.7↑\\uparrow\), DOM \(8\.5↑\\uparrow\), and DnD \(8\.4↑\\uparrow\) for out\-of\-domain tasks, where values in parentheses denote averageΔ\\Delta\(change in accuracy\) forQwen2\.5\-0\.5B\-Instruct\.Following were the adaptation strategies identified, which enabled SOLAR to reach the accuracy levels presented,
- •For ARC\-e and PIQA, it was`TTT`family with configuration \{“`ttl\_steps`”: 25, "`learning\_rate`”: 1e\-5, "`batch\_size`”: 4, "`shuffle\_data`”: True\}
- •For ARC\-c and SocialIQA, it was`LS`family with configuration \{“`times`”: 5, "`learning\_rate`”: 0\.1\}
- •For BoolQ, GSM\-MC and MATH\-MC, it was`LoRA`family with TS\-mixing strategy and the configuration was \{“`lambda`”:`0\.5`\}
- •For HellaSwag, DivLogicEval and CodeMMLU, it was`TTS`family\. Ex:, for Hellaswag, the corresponding configuration was \{“`num\_prompt\_batches`”:`20`, "`method`”:`max\_confidence`\}, indicative of the ensemble approach
### 6\.4Ablation Study
A primary effect we would like to isolate and study is that of initial prompt batch provided to start the LLM adaptation process using SOLAR\. It would be ideal if SOLAR results in similar performance even if a highly representative, diverse and influential prompt batch is used\. For this purpose, inspired by\[tang2025enhancingcrosstasktransferlarge\], we use the following strategy for prompt filtering and selection \(see Figure[5](https://arxiv.org/html/2605.20189#S6.F5)\)\.
We first model inter\-prompt relations as a directed graph𝒢=\(𝐕,𝐄,𝐏\)\\mathcal\{G\}=\(\\mathbf\{V\},\\mathbf\{E\},\\mathbf\{P\}\), wherein each prompt is encoded as a vector by using Sentence\-BERT\. Each vertexvi∈𝐕v\_\{i\}\\in\\mathbf\{V\}denotes a prompt \(sample\), a directed edgee\(i,j\)∈𝐄e\(i,j\)\\in\\mathbf\{E\}connectsviv\_\{i\}to its neighborvjv\_\{j\}, and weightp\(i,j\)∈𝐏p\(i,j\)\\in\\mathbf\{P\}is the cosine similarity of their embeddings\. For each nodeviv\_\{i\}, ansis\_\{i\}is computed as shown below so that nodes with higher average similarity make more connections\.
si=1\|𝐕\|−1∑j≠is\(i,j\),ki=⌈α⋅si⋅\(\|𝐕\|−1\)⌉s\_\{i\}=\\frac\{1\}\{\|\\mathbf\{V\}\|\-1\}\\sum\_\{j\\neq i\}s\(i,j\),\\quad k\_\{i\}=\\left\\lceil\\alpha\\cdot s\_\{i\}\\cdot\(\|\\mathbf\{V\}\|\-1\)\\right\\rceil
Samples are then scored by by \(1\) influence and \(2\) diversity\. The influence scoreI\(v\)I\(v\)is obtained by a diffusion simulation131313The simulation is run 20 times and is then averaged to obtain the final value\.\. For this, first initialize an active setSactive=\{v\}S\_\{\\rm active\}=\\\{v\\\}, then iteratively sample an active nodeuuand attempt to activate each neighborw∈N1\(u\)w\\in N\_\{1\}\(u\)with probabilityp\(u,w\)p\(u,w\)\. Newly activated nodes joinSactiveS\_\{\\rm active\}\. This process is repeated until no active nodes remain\. LetI\(v\)I\(v\)be the total number of visited nodes\. Diversity penaltyD\(v\)D\(v\)measures overlap with already selected nodes:
D\(v\)=−∑i=1kβi\|Ni\(v\)∩Sselected\|,f𝒢\(v\)=I\(v\)\+γD\(v\)D\(v\)=\-\\sum\_\{i=1\}^\{k\}\\beta^\{i\}\\,\\bigl\|N\_\{i\}\(v\)\\cap S\_\{\\rm selected\}\\bigr\|,\\quad f\_\{\\mathcal\{G\}\}\(v\)=I\(v\)\+\\gamma\\,D\(v\)
Finally, greedy graph search is done to select the final prompt subsetSS\. For this, start withS=∅S=\\emptysetand at each round pick
v∗=argmaxv∉Sf𝒢\(v\),v^\{\*\}=\\arg\\max\_\{v\\notin S\}f\_\{\\mathcal\{G\}\}\(v\),v∗v^\{\*\}is then added toSSand diversity penalties only for neighbors ofv∗v^\{\*\}are updated141414Note that the influence scores are precomputed\.\. This process continues until\|S\|\|S\|reaches the target size which in our case is 128\.
Figure 5:Details of the Prompt Selection Strategy used in Ablation StudyFortunately, the influence of the initial prompt batch wasmarginal\(with just a0\.3% improvement in accuracy when averaged across all evaluation datasets\), indicating that SOLAR can efficiently adapt LLMs to unseen datasets without the requirement of high\-quality or manually curated dataset\. Only a handful of unlabeled prompt instances which are merely indicative of the task suffice\.
## 7Conclusion
In this paper, we introduce SOLAR which is a novel paradigm for Streaming and Continual Learning by empowering LLMs to autonomously discover and retain parameter\- level adaptation strategies\. By bridging the gap between rapid test\-time adaptation \(plasticity\) and long\-term meta\-knowledge retention \(stability\), SOLAR addresses the core challenges of deploying agents in non\-stationary environments\. While currently reliant on a seed knowledge base, the framework lays the groundwork for fully autonomous, self\-evolving systems capable of navigating the open\-ended drifts of the real world\. Another key tradeoff is that of real\- time adaptation versus computation\. While SOLAR’s training phase is compute\-intensive, the inference\-time application of learned strategies is rapid\. By pre\-compiling complex adaptation routines into the knowledge base, SOLAR shifts the computational burden from the streaming phase to the offline meta\-learning phase\. This allows the agent to react to concept drift in near real\-time by simply retrieving and applying a cached strategy, rather than performing expensive gradient descent from scratch every time\.
## 8Acknowledgments
The authors would like to thank Professor Sashikumaar Ganesan, from the Department of Computational and Data Science at Indian Institute of Science, Bangalore for feedback and additional compute resources required to execute this project\.
## Declaration on Generative AI
During the preparation of this work, the authors used Large Language Models \(GPT\-5\.2, Claude Opus 4\.5 and Gemini\-3\) as a writing assistant tool for drafting content, to generate literature review, for abstract drafting, to paraphrase and reword, to improve writing style, for grammar and spelling check as well as to generate the images used in the paper\. The process was interactive\. After writing the core content, the authors used LLMs with specific prompts to refine the text\. These prompts included requests to “check for grammatical errors,” “rephrase this sentence for clarity,” “make this paragraph more concise,” or “suggest alternative phrasing to improve flow\.” The LLMs were not used to generate any scientific ideas, experimental results, data analysis or other core intellectual contributions of the paper\. After using these tool\(s\)/service\(s\), the authors reviewed and edited the content as needed and take full responsibility for the publication’s content\.
## ReferencesSimilar Articles
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play
PopuLoRA introduces a population-based asymmetric self-play framework for RLVR post-training of LLMs, where teacher and student LoRA adapters co-evolve to generate increasingly complex problems, overcoming the self-calibration limitation of single-agent self-play.
SkillOS: Learning Skill Curation for Self-Evolving Agents
This paper introduces SkillOS, a reinforcement learning framework that enables LLM agents to learn long-term skill curation policies for self-evolution, improving performance and generalization across tasks.
Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
Proposes SCALE, a framework for self-improving web agents using cognitive-aware exploration with three adversarial roles and a graph exploration strategy. Also introduces a large-scale dataset SCALE-20k from real websites, showing significant improvements in MLLM-based web agents.
Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration
The paper introduces COSMO-Agent, a tool-augmented reinforcement learning framework that trains LLMs to perform closed-loop CAD-CAE optimization, iteratively generating parametric geometries and running simulations until constraints are satisfied, with a multi-constraint reward and a new industry-aligned dataset.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
This paper proposes an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.