Tag
ToolFailBench, a diagnostic benchmark for tool-using agents, has been accepted at two ICML 2026 workshops, FAGEN and AIWILD.
A tweet observing that much AI cognition will be adequate for tasks, with remaining work involving diagnostic triage such as deciding whether to spend on a lawyer.
This paper investigates a harmful phenomenon in long chain-of-thought (CoT) training traces where post-conclusion continuation reduces training utility, and proposes a diagnostic method called HarmfulContinuationCut (HCC) to detect such harmful continuations.
This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.
Introduces SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory that measures multiple dimensions beyond aggregate metrics, revealing trade-offs between adaptability and stability.