Tag
Introducing Apodex, a self-evolving heavy-duty solver that uses a verification-centric agent team architecture for in-depth research. It supports self-solving, evidence chain verification, and more. Currently in early access and completely free.
The tweet discusses the concept of packaging personal workflows (including decomposition methods, verification rules, output formats, etc.) into reusable Skills, arguing that this self-evolving Compounding Loop aligns with cybernetics principles and is a key long-term capability.
Apodex 1.0 is a self-evolving AI system post-trained on Qwen3.5, achieving SOTA on BrowseComp, DeepSearchQA, and HLE-text. Its 4B mini model outperforms 30B-class models, with an AgentOS runtime for task orchestration. Open weights available.
HarnessX introduces a framework for self-evolving AI agent harnesses that treats the runtime harness as a first-class object, enabling automatic adaptation via trace-driven reinforcement learning. It achieves average gains of +14.5% across five benchmarks, with larger improvements for weaker models.
Microsoft introduces SkillOpt, a method that trains an agent's skill documentation like a neural network, using epochs, batches, learning rates, and validation sets for optimization, without modifying model weights. It achieves top results across multiple benchmarks and can be transferred across models and tools.
TabClaw is an open-source interactive AI agent for spreadsheet manipulation and table reasoning that uses LLMs to automate data analysis, support multi-table reasoning, and adapt to user preferences through memory and skill extraction.
Memento-Skills is a self-evolving agent framework where agents learn from failures and rewrite their own skills, improving over time through a Read-Execute-Reflect-Write loop. It was tested on HLE and GAIA benchmarks and supports open-source LLMs like Kimi, MiniMax, and GLM.
This paper introduces SkeMex, a self-evolving framework that enhances medical agents by distilling interaction trajectories into structured skill memory, enabling better long-term clinical reasoning through context-dependent utility estimation and governance.
Skill-3D is a framework that enables AI agents to learn scene-aware skills through self-evolving memory and skill libraries, significantly improving tool utilization in 3D spatial reasoning tasks (e.g., from 39% to 78% on VSI-Bench).
SePO (Self-Evolving Prompt Optimization) proposes a self-referential prompt agent that optimizes both task agents' system prompts and its own system prompt through an evolutionary search, outperforming Manual-CoT, TextGrad, and MetaSPO across five benchmarks including AIME'25, ARC-AGI-1, and GPQA.
Parthenon is a self-evolving legal-agent framework that structures LLM agents into six auditable layers and uses an anti-leakage learning loop to improve performance on end-to-end legal matters without modifying model weights. A large-scale empirical study on Harvey LAB with 12,510 agent trajectories shows current frontier agents still struggle with strict matter completion, and Parthenon substantially improves results over state-of-the-art baselines.
MLEvolve is a self-evolving LLM-based multi-agent framework for automated ML algorithm discovery that extends tree search to Progressive MCGS with graph-based cross-branch information flow and retrospective memory. It achieves state-of-the-art performance on MLE-Bench and outperforms AlphaEvolve on mathematical algorithm optimization tasks.
Introduces SkillDAG, a self-evolving typed directed graph for LLM skill selection at scale that models inter-skill relationships and allows agents to query and evolve the graph during execution, outperforming baselines on ALFWorld and SkillsBench.
This paper presents Traj-Evolve, a self-evolving multi-agent system that uses an experience pool and multi-agent reinforcement learning to model patient trajectories from longitudinal EHRs for lung cancer early detection, outperforming strong baselines.
This post discusses a paper, pointing out that in the self-evolution of Agent systems, updating Harness (writing useful updates) and benefiting from updates (actually using them in subsequent tasks) are two different abilities. The latter is key, and weak models often fail to use the rules.
EvoDS is a self-evolving autonomous data science agent that improves via reinforcement learning-driven skill acquisition and adaptive context compression, outperforming open-source agents by 28.9% on benchmarks.
This paper introduces GrowLoop, a self-evolving evaluation system for assessing human-likeness in open-ended conversations. It uses minimal human seed annotations to iteratively refine evaluation rubrics, addressing challenges of tacit knowledge, varying human agreement, and evolving model capabilities.
This paper introduces CUDAnalyst, a tool for analyzing how individual feedback signals influence planning decisions in self-evolving LLM agents for CUDA kernel generation, using trajectory freezing and selective feedback injection to enable controlled attribution.
SkillOpt introduces a systematic controllable text-space optimizer that enables AI agents to train and improve their own skills (like 'work instructions') through iterative edits and validation, outperforming human-crafted and one-shot prompts across multiple benchmarks and models.
Microsoft Research introduces SkillOpt, a method that treats agent skill documents as trainable external state, using an optimizer model to make bounded edits validated by a held-out set. The approach achieves best or tied results across 52 evaluation cells and improves accuracy by over 23 points on GPT-5.5, with zero extra inference cost and transferable skills.