diagnostic

#diagnostic

When Does Machine Learning Beat Value Sorting? A Three-Dataset Diagnostic of Exposure-Weighted Shipment Prioritization

arXiv cs.AI ↗ · yesterday Cached

This paper investigates when machine learning outperforms traditional value sorting for exposure-weighted shipment prioritization using three datasets.

0 favorites 0 likes

#diagnostic

What Survives Into Context: A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It

arXiv cs.CL ↗ · 2026-07-02 Cached

This paper introduces answer-in-context, a diagnostic metric for budget-constrained multi-hop RAG that measures whether the gold answer survives in the packed reader context, and proposes a submodular evidence packing method that improves over heuristics under specific conditions.

0 favorites 0 likes

#diagnostic

When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence

arXiv cs.LG ↗ · 2026-06-26 Cached

This paper proposes a leakage-safe diagnostic to test whether quality-aware multimodal fusion methods actually use reliability scores during inference, by permuting these scores across test examples. Experiments on StressID and CMU-MOSEI show that shuffled reliability scores leave performance unchanged, indicating that quality signals only influence decisions when they reliably predict unimodal correctness.

0 favorites 0 likes

#diagnostic

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

arXiv cs.CL ↗ · 2026-06-15 Cached

This paper introduces MoDiCoL, a modular diagnostic continual learning dataset for robust speech recognition, enabling controlled analysis of linguistic content, speaker characteristics, and acoustic environments, and proposes a continual learning curriculum to study how robustness is acquired, transferred, and forgotten.

0 favorites 0 likes

#diagnostic

@SoHarshhh: Really happy to share that “ToolFailBench” got accepted at two ICML 2026 workshops, FAGEN and AIWILD. Most benchmarks e…

X AI KOLs Following ↗ · 2026-06-01 Cached

ToolFailBench, a diagnostic benchmark for tool-using agents, has been accepted at two ICML 2026 workshops, FAGEN and AIWILD.

0 favorites 0 likes

#diagnostic

@patio11: Much of that cognition is going to be entirely adequate for the task at hand. Some of the rest will be the important di…

X AI KOLs Following ↗ · 2026-05-31

A tweet observing that much AI cognition will be adequate for tasks, with remaining work involving diagnostic triage such as deciding whether to spend on a lawyer.

0 favorites 0 likes

#diagnostic

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

arXiv cs.AI ↗ · 2026-05-29 Cached

This paper investigates a harmful phenomenon in long chain-of-thought (CoT) training traces where post-conclusion continuation reduces training utility, and proposes a diagnostic method called HarmfulContinuationCut (HCC) to detect such harmful continuations.

0 favorites 0 likes

#diagnostic

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

arXiv cs.LG ↗ · 2026-05-29 Cached

This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.

0 favorites 0 likes

#diagnostic

Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

arXiv cs.LG ↗ · 2026-05-18 Cached

Introduces SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory that measures multiple dimensions beyond aggregate metrics, revealing trade-offs between adaptability and stability.

0 favorites 0 likes

diagnostic

Submit Feedback