Tag
Max Lamparth discusses the evolution of AI, the misunderstanding of AI as a monolithic technology, where AI creates real value (e.g., code, math, drug discovery) versus hype, and the importance of trust and reliability in high-stakes settings.
Anthropic released an 11-page paper titled 'Loop Design: The Anthropic Playbook for Agentic Systems', arguing that independent verifiers are more critical than prompts in agent design.
A senior Google engineer released a 19-page PDF on 'Loop Engineering' for LLM and agentic systems, outlining an iterative feedback loop where the LLM proposes code transformations, observes compiler feedback, learns from it, and repeats until improvements stop.
PseudoBench is a benchmark to evaluate whether LLM-based agentic auto-research systems can resist pseudoscientific narratives. Testing seven state-of-the-art agents reveals they readily produce persuasive pseudoscientific reports with near-zero refusal rates, calling for scientific alignment before deployment.
This paper proposes offline preference-based trajectory evaluation for agentic systems, which compares trajectories via temporal preferences rather than binary success metrics. It shows that this approach reduces ties from roughly 75% to 35%, improving discriminative power and data efficiency across diverse benchmarks.
Introduces Playful Agentic Robot Learning, where embodied coding agents use self-directed play to learn reusable skills, improving downstream task performance without additional training. The proposed RATs system achieves significant gains over baselines in simulation and real-world transfer.
A paper presenting The AI Scientist, a system that automates the entire research lifecycle from idea generation to peer review, demonstrating AI's growing capacity for scientific contribution.
A tweet discussing two agentic GPU kernel optimization systems: Auto GPU Kernel by @dogacel0 and Kernel Design Agents from @songhan_mit's lab, both winners at the MLSys Sparse Attention FlashInfer competition. The thread highlights different approaches using subagents and Claude skills for GPU programming.
The paper proposes GTBP, a graph-based back-propagation framework for context adaptation in multi-LLM agentic systems, which improves prompt optimization with theoretical convergence guarantees and outperforms existing methods on benchmarks.
The article debates whether future AI systems should use a unified agent stack or modular ensembles, and advocates for more realistic robustness benchmarks beyond static evaluations.
The author built an AI research tool that reduces hallucination through strict orchestration and harness engineering, enabling users to supervise research decisions and verify sources.
TimeRouter introduces an efficient routing framework for time-series foundation models that uses lightweight discriminative routing and selective gating to adaptively select the best expert model without LLM overhead, achieving state-of-the-art on the GIFT-EVAL leaderboard.
Introduces RECAP, a benchmark for evaluating continual learning of prompts under evolving constraints in a proactive adaptation setting. Results show that existing prompt optimization methods fail in this setting, highlighting the need for new methods.
This paper introduces AARR (Act As a Real Researcher), a suite of benchmarks to evaluate frontier LLMs and agentic systems on granular research scenarios. The first benchmark, AARRI-Bench, reveals that even top-performing agents achieve only 68.3% success, highlighting gaps in field sensitivity and nuanced reasoning.
τ-Rec is a verifiable benchmark for agentic recommender systems that replaces subjective LLM-as-a-judge evaluations with verifiable rewards and controlled dialogue constraints, revealing steep reliability cliffs across leading models where even the best achieves only ~57% pass@1.
This paper proposes 'Trivium,' a framework that introduces long-horizon temporal regret and epistemic regret as first-class objectives alongside outcome regret for causal-memory controllers in agentic LLM systems. The authors prove that outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, while their approach achieves O(log E) temporal regret on CausalBench-Seq experiments versus linear growth for baselines.
Researchers introduce MedSP1000, a 1,638-case interactive benchmark derived from standardized patient scenarios to evaluate LLMs as dynamic clinical agents across multi-turn encounters. Results show even the best model (GPT-5.5) completes only 60.4% of expert rubric items, suggesting current LLMs are not yet reliable enough for clinical practice.
This paper proposes principled approaches for designing and optimizing practical agentic LLM systems, introducing a framework with pseudo-tools and fixed workflows to improve modularity, cost-efficiency, and accuracy across diverse tasks.
MAVEN is a lightweight symbolic reasoning scaffold that improves generalization in agentic tool calling by using modular verification and adaptive tool orchestration. It achieves significant accuracy gains on a new stress-test benchmark (MAVEN-Bench) and remains competitive with proprietary models at a fraction of the cost.
Peter Steinberger shares he has secured his visa and is moving to San Francisco for MS Build and an OpenClaw after-hours event at GitHub HQ, which includes fireside chats, panels, and demos from NVIDIA focused on agentic systems.