diagnostic

#diagnostic

@SoHarshhh: Really happy to share that “ToolFailBench” got accepted at two ICML 2026 workshops, FAGEN and AIWILD. Most benchmarks e…

X AI KOLs Following ↗ · 3d ago Cached

ToolFailBench, a diagnostic benchmark for tool-using agents, has been accepted at two ICML 2026 workshops, FAGEN and AIWILD.

0 favorites 0 likes

#diagnostic

@patio11: Much of that cognition is going to be entirely adequate for the task at hand. Some of the rest will be the important di…

X AI KOLs Following ↗ · 3d ago

A tweet observing that much AI cognition will be adequate for tasks, with remaining work involving diagnostic triage such as deciding whether to spend on a lawyer.

0 favorites 0 likes

#diagnostic

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

arXiv cs.AI ↗ · 6d ago Cached

This paper investigates a harmful phenomenon in long chain-of-thought (CoT) training traces where post-conclusion continuation reduces training utility, and proposes a diagnostic method called HarmfulContinuationCut (HCC) to detect such harmful continuations.

0 favorites 0 likes

#diagnostic

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

arXiv cs.LG ↗ · 6d ago Cached

This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.

0 favorites 0 likes

#diagnostic

Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

arXiv cs.LG ↗ · 2026-05-18 Cached

Introduces SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory that measures multiple dimensions beyond aggregate metrics, revealing trade-offs between adaptability and stability.

0 favorites 0 likes

diagnostic

@SoHarshhh: Really happy to share that “ToolFailBench” got accepted at two ICML 2026 workshops, FAGEN and AIWILD. Most benchmarks e…

@patio11: Much of that cognition is going to be entirely adequate for the task at hand. Some of the rest will be the important di…

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

Submit Feedback