Tag
ActiveGraph announces two new papers on agent memory (LongMemEval) and self-improvement regimes, along with reference agents, pack templates, and upcoming meetups in Seattle and San Francisco.
Introduces Test-Time Reinforcement Learning (TTRL), a method that uses majority voting on unlabeled data to create pseudo-labels for RL training, enabling self-improvement of LLMs without ground-truth answers. Achieves significant gains (e.g., +159-211% on AIME 2024 for Qwen-2.5-Math-7B).
According to speculation, Anthropic's new model Mythos, after completing training in February this year, quietly changed the R&D rhythm, leading to a significant leap in AI capabilities over the past 5 months. Leading models are helping to train the next generation of models.
The paper proposes Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework that uses skill-guided policies to generate supervision for off-trajectory states during closed-loop execution, improving GUI agent success rates on OSWorld-Verified from around 30% to over 50%.
ENPIRE is a framework that enables autonomous robot policy self-improvement in the real world through a closed-loop system of environment feedback, policy refinement, and evolutionary code optimization, achieving 99% success on dexterous manipulation tasks.
This article lists 25 abilities worth long-term training for ordinary people in the next ten years, including personal IP, AI application, sales, self-media, etc., emphasizing the accumulation of core abilities rather than chasing hotspots.
NVIDIA GEAR lab introduces ENPIRE, a framework for autonomous real-world robot policy self-improvement that achieves 99% success on dexterous manipulation tasks like GPU insertion and zip-tying, with multi-robot parallel learning and open-source release.
This tweet describes the four-layer compound stack structure of the AI agent system: bottom layer primitives (Fable 5, sub-agents, worktree), orchestration layer (goal loops, dynamic workflows, cloud Routines), memory layer (state files, Skills, knowledge bases), and top layer self-improvement (visual self-inspection, evaluation loops, rule distillation).
APEX proposes a three-layer self-evolution framework for production AI agents that simultaneously optimizes the harness, behavioural principles, and workflow topology. Experiments on a production agent show significant improvements in health score and workflow quality with minimal LLM calls.
Vadim Fedenko shares a technical analysis of Recursive Self-Improvement (RSI), arguing that true RSI requires improving capability faster than complexity and expanding architectural space rather than just optimizing within fixed parameters. He doubts recent claims by xAI and Anthropic that RSI could arrive within a year, citing LLMs' poor subtractive engineering skills and current reward functions that ignore complexity.
This article summarizes eight essential skills for conducting research, including topic selection, judgment, input, record-keeping, rapid trial and error, attention to detail, cross-disciplinary collaboration, and seeking feedback, emphasizing that research ability is a long-term cumulative process.
SIFT is a product that helps users break hidden habits holding them back.
Introduces Write Gate for Hermes Agent, allowing users to approve or deny memory and skill updates, enhancing control and security for AI agent self-improvement.
The author showcases a controlled self-improvement approach for AI agents using a regime-to-seam method where failures are categorized to fix targeted areas, built on activegraph.
Anthropic's paper explores scenarios where AI systems autonomously build or improve themselves, discussing implications for safety and alignment.
This tweet thread introduces research showing that training models to verify their own work can nearly double accuracy on hard math problems and improve scientific reasoning by 14x.
This article summarizes a deep discussion among three Google DeepMind researchers on reasoning, multimodal generation (Omni), coding, and self-improvement, emphasizing that visual and dynamic thinking will surpass text-based chain-of-thought, and explores future trends in world models and synthetic training cases.
This paper introduces the Meta-Agent Challenge (MAC), a benchmark for evaluating AI models' ability to autonomously develop agent systems through iterative programming. Results show that current models rarely match human baselines and exhibit issues like reward hacking, highlighting gaps in self-improvement capabilities.
The author explores building an AI agent system called SPINE that can develop and improve itself using local inference models, focusing on deterministic workflows and legibility to allow modest models to operate reliably.
This paper introduces a 'Sleep' paradigm for large language models that enables continual learning through memory consolidation and dreaming phases, allowing models to distill short-term knowledge into long-term parameters and self-improve without human supervision.