agentic-systems

Tag

Cards List
#agentic-systems

@0xRicker: Anthropic Agents team just dropped an 11-page paper on "Loop Design: The Anthropic Playbook for Agentic Systems" Everyo…

X AI KOLs Timeline · 21h ago Cached

Anthropic released an 11-page paper titled 'Loop Design: The Anthropic Playbook for Agentic Systems', arguing that independent verifiers are more critical than prompts in agent design.

0 favorites 0 likes
#agentic-systems

@0xMovez: A senior Google engineer just dropped a 19-page PDF on "Loop Engineering" for LLM and agentic systems. Act → Observe → …

X AI KOLs Timeline · yesterday Cached

A senior Google engineer released a 19-page PDF on 'Loop Engineering' for LLM and agentic systems, outlining an iterative feedback loop where the LLM proposes code transformations, observes compiler feedback, learns from it, and repeats until improvements stop.

0 favorites 0 likes
#agentic-systems

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

arXiv cs.AI · 2026-06-17 Cached

PseudoBench is a benchmark to evaluate whether LLM-based agentic auto-research systems can resist pseudoscientific narratives. Testing seven state-of-the-art agents reveals they readily produce persuasive pseudoscientific reports with near-zero refusal rates, calling for scientific alignment before deployment.

0 favorites 0 likes
#agentic-systems

Offline Preference-Based Trajectory Evaluation

arXiv cs.LG · 2026-06-17 Cached

This paper proposes offline preference-based trajectory evaluation for agentic systems, which compares trajectories via temporal preferences rather than binary success metrics. It shows that this approach reduces ties from roughly 75% to 35%, improving discriminative power and data efficiency across diverse benchmarks.

0 favorites 0 likes
#agentic-systems

Playful Agentic Robot Learning

Hugging Face Daily Papers · 2026-06-17 Cached

Introduces Playful Agentic Robot Learning, where embodied coding agents use self-directed play to learn reusable skills, improving downstream task performance without additional training. The proposed RATs system achieves significant gains over baselines in simulation and real-world transfer.

0 favorites 0 likes
#agentic-systems

Towards End-to-End Automation of AI Research

arXiv cs.AI · 2026-06-16 Cached

A paper presenting The AI Scientist, a system that automates the entire research lifecycle from idea generation to peer review, demonstrating AI's growing capacity for scientific contribution.

0 favorites 0 likes
#agentic-systems

@levidiamode: 163/365 of GPU Programming Looking at a few different agentic GPU kernel optimization systems today. The two I'm most i…

X AI KOLs Timeline · 2026-06-15 Cached

A tweet discussing two agentic GPU kernel optimization systems: Auto GPU Kernel by @dogacel0 and Kernel Design Agents from @songhan_mit's lab, both winners at the MLSys Sparse Attention FlashInfer competition. The thread highlights different approaches using subagents and Claude skills for GPU programming.

0 favorites 0 likes
#agentic-systems

Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems

arXiv cs.LG · 2026-06-15 Cached

The paper proposes GTBP, a graph-based back-propagation framework for context adaptation in multi-LLM agentic systems, which improves prompt optimization with theoretical convergence guarantees and outperforms existing methods on benchmarks.

0 favorites 0 likes
#agentic-systems

As we scale toward agentic, multimodal systems combining LLMs, RLHF, tool-use, and retrieval-augmented generation, what practical architecture best balances reliability, alignment, and cost?

Reddit r/artificial · 2026-06-11

The article debates whether future AI systems should use a unified agent stack or modular ensembles, and advocates for more realistic robustness benchmarks beyond static evaluations.

0 favorites 0 likes
#agentic-systems

AI system that let's you supervise and direct research

Reddit r/AI_Agents · 2026-06-11

The author built an AI research tool that reduces hallucination through strict orchestration and harness engineering, enabling users to supervise research decisions and verify sources.

0 favorites 0 likes
#agentic-systems

TimeRouter: Efficient and Adaptive Routing of Time-Series Foundation Models

arXiv cs.LG · 2026-06-11 Cached

TimeRouter introduces an efficient routing framework for time-series foundation models that uses lightweight discriminative routing and selective gating to adaptively select the best expert model without LLM overhead, achieving state-of-the-art on the GIFT-EVAL leaderboard.

0 favorites 0 likes
#agentic-systems

RECAP: Regression Evaluation for Continual Adaptation of Prompts

arXiv cs.LG · 2026-06-08 Cached

Introduces RECAP, a benchmark for evaluating continual learning of prompts under evolving constraints in a proactive adaptation setting. Results show that existing prompt optimization methods fail in this setting, highlighting the need for new methods.

0 favorites 0 likes
#agentic-systems

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

arXiv cs.AI · 2026-06-08 Cached

This paper introduces AARR (Act As a Real Researcher), a suite of benchmarks to evaluate frontier LLMs and agentic systems on granular research scenarios. The first benchmark, AARRI-Bench, reveals that even top-performing agents achieve only 68.3% success, highlighting gaps in field sensitivity and nuanced reasoning.

0 favorites 0 likes
#agentic-systems

τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems

Hugging Face Daily Papers · 2026-06-08 Cached

τ-Rec is a verifiable benchmark for agentic recommender systems that replaces subjective LLM-as-a-judge evaluations with verifiable rewards and controlled dialogue constraints, revealing steep reliability cliffs across leading models where even the best achieves only ~57% pass@1.

0 favorites 0 likes
#agentic-systems

Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

arXiv cs.AI · 2026-06-04 Cached

This paper proposes 'Trivium,' a framework that introduces long-horizon temporal regret and epistemic regret as first-class objectives alongside outcome regret for causal-memory controllers in agentic LLM systems. The authors prove that outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, while their approach achieves O(log E) temporal regret on CausalBench-Seq experiments versus linear growth for baselines.

0 favorites 0 likes
#agentic-systems

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Hugging Face Daily Papers · 2026-06-03

Researchers introduce MedSP1000, a 1,638-case interactive benchmark derived from standardized patient scenarios to evaluate LLMs as dynamic clinical agents across multi-turn encounters. Results show even the best model (GPT-5.5) completes only 60.4% of expert rubric items, suggesting current LLMs are not yet reliable enough for clinical practice.

0 favorites 0 likes
#agentic-systems

Learning to Construct Practical Agentic Systems

arXiv cs.LG · 2026-06-02 Cached

This paper proposes principled approaches for designing and optimizing practical agentic LLM systems, introducing a framework with pseudo-tools and fixed workflows to improve modularity, cost-efficiency, and accuracy across diverse tasks.

0 favorites 0 likes
#agentic-systems

MAVEN: Improving Generalization in Agentic Tool Calling

arXiv cs.AI · 2026-06-01 Cached

MAVEN is a lightweight symbolic reasoning scaffold that improves generalization in agentic tool calling by using modular verification and adaptive tool orchestration. It achieves significant accuracy gains on a new stress-test benchmark (MAVEN-Bench) and remains competitive with proprietary models at a fraction of the cost.

0 favorites 0 likes
#agentic-systems

@steipete: Finally got my visa sorted out and moving to San Francisco, just in time for MS Build and OpenClaw’s after hours!

X AI KOLs Timeline · 2026-05-31 Cached

Peter Steinberger shares he has secured his visa and is moving to San Francisco for MS Build and an OpenClaw after-hours event at GitHub HQ, which includes fireside chats, panels, and demos from NVIDIA focused on agentic systems.

0 favorites 0 likes
#agentic-systems

Blaming the dev is the wrong frame when the review layer doesn't exist.

Reddit r/AI_Agents · 2026-05-31

A discussion about an incident where an AI coding agent deleted a production database, arguing that blaming the developer is misplaced when proper review processes are not in place.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback