Tag
The author introduces Tiro, an open-source agentic memory and retrieval framework designed to solve long-term context drift in LLM agents by providing modular, inspectable memory lanes for sessions, documents, and operational state.
This research blog post demonstrates that repeatedly rewriting LLM agent experiences into textual 'lessons' often degrades performance rather than improving it. The author finds that episodic memory retention performs better than abstract consolidation across various benchmarks like ARC-AGI and ALFWorld.
The Perplexity team has published guidelines for the design, iteration, and maintenance of Agent Skills, emphasizing that writing Skills is not traditional coding but rather constructing context for the model. The article proposes a counter-intuitive methodology focused on evaluation-first approaches, progressive loading, and optimizing Agent behavior by handling edge cases (Gotchas).
The article outlines a method for creating a personalized automated content engine using a single Markdown file for data and an HTML dashboard powered by Claude agents to replace paid SaaS tools.
RAO (Recursive Agent Optimization) is an end-to-end reinforcement learning approach for training LLM agents to spawn, delegate to, and coordinate with recursive copies of themselves, turning recursive inference into a learned capability.
A GitHub repository containing 30 runnable Jupyter notebooks that comprehensively explain LLM agent memory technologies, from short-term context to production-grade patterns, covering methods like MemGPT, Zep, Graphiti, along with decision trees and comparison tables.
This paper introduces SkillRet, a large-scale benchmark for evaluating skill retrieval in LLM agents, addressing the challenge of selecting relevant skills from large libraries. It provides a dataset of over 17,000 skills and demonstrates that task-specific fine-tuning significantly improves retrieval performance.
This paper challenges the assumption that adding more scaffolding components to LLM agents always improves performance, demonstrating through systematic experiments that cross-component interference often leads to degradation. The study finds that simpler, task-specific subsets of components frequently outperform fully equipped 'all-in' agents across various model scales.
This paper introduces BeliefMem, a novel memory paradigm for LLM agents that stores multiple candidate conclusions with probabilities to handle partial observability and reduce self-reinforcing errors. Empirical evaluations show it outperforms deterministic baselines on LoCoMo and ALFWorld benchmarks.
This paper presents an agentic system using Large Language Models to automate the discovery of exchange-correlation functionals in Density Functional Theory, achieving improvements over human-designed baselines while highlighting challenges with benchmark overfitting.
This paper introduces 'constant-context skill learning,' a framework that moves procedural knowledge from prompts into model weights to reduce token usage and improve privacy for LLM agents. The method achieves strong performance on benchmarks like ALFWorld and WebShop while significantly reducing inference costs.
The article introduces MANTRA, a framework for automatically synthesizing SMT-validated compliance benchmarks for tool-using LLM agents from natural language manuals. It demonstrates that this approach enables scalable and reliable evaluation of agent adherence to complex procedural rules.
StraTA proposes strategic trajectory abstraction for long-horizon LLM agents, using hierarchical GRPO-style rollout with diverse strategy sampling and critical self-judgment to improve sample efficiency and final performance over frontier models and prior RL baselines.
Skill1 is a unified framework that trains a single policy to co-evolve skill selection, utilization, and distillation using a shared task-outcome objective. Experiments on ALFWorld and WebShop show it outperforms existing baselines in complex task environments.
This paper introduces SkillOS, a reinforcement learning framework that enables LLM agents to learn long-term skill curation policies for self-evolution, improving performance and generalization across tasks.
Simon Willison reflects on how vibe coding and agentic engineering are converging in his own workflow, raising concerns about code review responsibilities as AI coding agents like Claude Code become increasingly reliable. He explores the ethical tension between trusting AI-generated code in production and maintaining software engineering standards.
ARIS is an open-source research harness that uses cross-model adversarial collaboration to ensure reliable long-term research outcomes through coordinated execution, orchestration, and assurance layers.
The paper introduces Direct Corpus Interaction (DCI), a novel approach allowing AI agents to query raw text directly using standard terminal tools instead of traditional embedding-based retrieval. By bypassing fixed similarity interfaces and offline indexing, DCI significantly outperforms conventional sparse, dense, and reranking baselines across multiple IR and agentic search benchmarks.
Large-scale study finds LLM agents can predict individual social-media reactions with 70.7 % accuracy but still lag behind simple TF-IDF classifiers, highlighting both manipulation risks and policy-simulation utility.
An open-source, local-first memory layer for LLM agents on macOS that captures user activity and saves it as Markdown files.