Tag
ToolSense is an open-source diagnostic framework that generates three benchmarks (realistic retrieval, MCQ probing, QA probing) to audit LLMs' parametric tool knowledge, revealing a knowledge-retrieval dissociation where strong retrieval performance can coexist with poor factual understanding.
This paper presents a corpus-centric diagnostic framework for analyzing biomedical NER and EL benchmarks, revealing substantial differences across nine corpora and arguing that standard statistics are insufficient for characterizing evaluation demands.