Tag
An LLM was given access to a thermal camera pointing at the Raspberry Pi it runs on, and it began conducting experiments by toggling the fan to observe temperature changes.
This article explores how AI agents can automatically write and optimize their skill files using techniques like SkillOpt from Microsoft Research, which treats skill documents as trainable state and delivers significant performance improvements. It addresses the challenge of manual skill tuning and presents frameworks like GEPA and EvoSkill as evolutionary approaches.
TokenArch Lanterns is a framework for exploring and developing standards for autonomous agents.
A detailed thread explaining the high-level architecture of the SWE AI coding agent, showing how a GitHub issue flows through ingestion, an orchestrator, model gateway, tools, code intelligence, sandbox environment, PR builder, guardrails, and observability to autonomously produce a pull request.
The article explains the shift from manually prompting coding agents to designing automated loops that prompt them, detailing what these loops are, their historical evolution, and the components needed to build them in production.
Atomic Mail Agentic enables AI agents to autonomously read, send, and react to emails, streamlining email management and automation.
The article questions why most autonomous agents are developed for business use rather than for individual users, pointing out a gap in AI accessibility.
This paper introduces BeliefDiffusion, a framework combining diffusion models to represent multimodal belief distributions and Model Predictive Control for planning in partially observable environments, achieving better navigation success and path efficiency than baselines.
Researchers at Nvidia Gear Lab achieved a milestone where 8 Codex-AutoResearch agents autonomously controlled a robot fleet to solve a physical world task without human intervention, demonstrating self-improvement.
A guide to setting up a local AI agent framework using iPhone, Mac Mini M4, and Claude Opus 4.8, allowing autonomous agents to run 24/7 at home, handle tasks, and improve over time.
This paper introduces Base Sequence Analysis, a framework that encodes LLM agent runtime behavior into compact sequences, revealing high-risk patterns like the 'P-X-P' trigram and a verification deficit. It presents Governor, a runtime intervention system that improves task success by 6.2% and reduces token consumption by 44%.
The paper proposes the Minimum Sufficient Oversight Principle (MSO) for governing delegated AI systems, deriving mathematical solutions for autonomy allocation and trust calibration, and introduces concepts like water-filling allocation and masking pathology.
Introduces Simmer, a benchmark for evaluating latent failures in LLM-generated executable plans using a human-curated symbolic world model in the kitchen domain. Experiments show frontier LLMs achieve at most 17% error-free plans, with up to 56% containing latent failures, and counterfactual foresight simulation reduces failures significantly.
A recorded discussion about effectively running autonomous long-running coding agents, including insights on goal setting, model selection, and best practices, made freely available.
This article discusses the key requirements for AI agents to successfully complete real-world tasks: a real phone number, email address, and payment method, highlighting products like AgentLine, Agent Mail, and Agent Card that provide these capabilities.
A developer created a zero-player civilization game where LLM agents autonomously farm, reproduce, build, and wage wars, driven by Maslow's hierarchy of needs, with emergent religious conflicts and societal collapses.
Arbor introduces structured tree search as a cognition layer for autonomous agents, enabling multi-day, full-stack LLM inference optimization with up to 193% throughput-latency improvement over vendor baselines through a checks-and-balances multi-agent architecture.
This article contrasts two AI usage patterns: reactive search vs. autonomous agents, arguing that real efficiency gains come from delegating multi-step tasks to AI tools like OpenClaw. It notes that while most people stick with the simpler prompt-response loop, moving to agent-based workflows requires clear goal setting.
A comparison between reactive AI usage and autonomous agents, highlighting significant time savings when using agents like OpenClaw for email and research tasks.
Elvis Sun shares a detailed playbook on using AI coding agents with harness engineering and loss function development to autonomously solve complex engineering problems, demonstrating how to avoid common pitfalls like agent cheating.