Tag
This paper analyzes the confidence calibration of 11 popular LLMs, finding that they are generally overconfident, especially on hard tasks, and underconfident on easy tasks. It introduces LifeEval, a test for evaluating calibration across difficulty levels.
This paper empirically tests the psychometric reliability of LLM-based user state classification, finding that only 31 of 213 metrics met reliability criteria, questioning trust in real-time adaptive systems.
PRISM is a closed-loop framework that treats prompt engineering as a continuous reliability problem for enterprise conversational AI. It automates test generation, simulation, evaluation, and repair, achieving 99% reliability and reducing authoring time from days to minutes.
This paper surveys the capabilities and limitations of AI across the full research lifecycle, from idea generation to dissemination, identifying a sharp boundary between reliable assistance and unreliable autonomy. It provides a taxonomy, benchmark suite, tool inventory, and design principles for human-governed AI collaboration in research.
This paper introduces AgentForesight, a framework for online auditing and early failure prediction in LLM-based multi-agent systems. It presents a new dataset, AFTraj-22K, and a specialized model, AgentForesight-7B, which outperforms leading proprietary models in detecting decisive errors during trajectory execution.
A former AI advocate details disillusionment with large language models, citing reliability issues, regression between versions, broken enterprise workflows, and lack of accountability in AI systems deployed across critical industries.
A user discusses frustrations with the reliability and consistency of free AI models when used as educational tutors, questioning whether paid versions offer significantly better performance for learning technical concepts.
Researchers introduce SHADE, a hybrid estimator that combines Good-Turing coverage with graph-spectral cues to quantify semantic uncertainty and detect LLM hallucinations when only a few black-box samples are available.
This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.
A user documented a sequence in which Gemini detected a real $280M KelpDAO/AAVE crypto exploit mid-conversation, retracted it as a hallucination under user skepticism, then reconfirmed it once mainstream coverage caught up — illustrating how AI anti-hallucination overcorrection can cause models to retract accurate information.
MIT researchers developed a new method for identifying overconfident LLMs by measuring cross-model disagreement across similar models, rather than relying solely on self-consistency metrics. This approach better captures epistemic uncertainty and more accurately identifies unreliable predictions in high-stakes applications.