Tag
R-APS (Reflective Adversarial Pareto Search) is a novel method for constrained design tasks that addresses three structural failures in LLM-based agentic systems—error propagation, robustness evaluation, and knowledge invalidation—through reasoning-mode decomposition across three timescales, requiring no fine-tuning. Evaluated on planar mechanism synthesis, it achieves 3.5x tighter robustness certificates, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over baselines.
MAVEN is a lightweight symbolic reasoning scaffold that improves generalization in agentic tool calling by using modular verification and adaptive tool orchestration. It achieves significant accuracy gains on a new stress-test benchmark (MAVEN-Bench) and remains competitive with proprietary models at a fraction of the cost.
This paper introduces 'composition collapse', a phenomenon where language models with stable factual knowledge still fail to compose that knowledge into correct multi-hop reasoning, and proposes a double-gate protocol to isolate composition failure from atomic knowledge instability.
This research paper investigates how shortcut solutions learned by Transformer models, specifically BERT, impair their ability to perform continual compositional reasoning. It contrasts BERT with ALBERT, finding that ALBERT's recurrent nature offers better inductive bias for continual learning tasks.
The Amazing Agent Race (AAR) introduces a new benchmark with 1,400 directed acyclic graph (DAG) puzzle instances to evaluate LLM agents on fork-merge tool chains and Wikipedia navigation. Evaluations reveal agents excel at tool-use (errors <17%) but struggle with navigation (27-52% of failures), exposing a critical gap invisible to existing linear benchmarks.
Proposes Slipform, a training framework that uses lexical concreteness to select harder negatives and a margin-based Cement loss, boosting compositional reasoning in vision-language models.