Tag
This paper identifies harmful continuations in answer-correct long chain-of-thought training traces for LLM SFT, characterized by uncertainty-geometry mismatches, and proposes a lightweight boundary proxy method to remove them.
The paper introduces SpatialUncertain, a benchmark to evaluate whether vision-language models recognize when they cannot answer spatial questions due to occlusion or perspective ambiguity, revealing overconfidence and poor abstention behavior.
This paper presents a prototype framework for managing uncertainty in LLM-generated procedural knowledge for virtual laboratory planning, using structured domain representations to repair uncertain procedural steps.
Researchers used an IBM quantum computer to reduce uncertainty in an AI model, achieving the first demonstration of quantum enhancement in a pretrained large language model, allowing it to answer questions correctly where the base model failed.
A new Google paper argues that LLMs should focus on expressing uncertainty honestly rather than aiming for perfect factuality, proposing 'faithful uncertainty' to build trust.
This paper proposes a family of metrics called ECUAS_n for principled evaluation of uncertainty-augmented systems that output both predictions and uncertainty scores. The authors argue that existing evaluation approaches are inadequate and formulate these metrics as proper scoring rules for decision-making under uncertainty.
The paper introduces the Bayesian Filtering Transformer (BFT), which incorporates uncertainty into Transformers via precision-weighted attention and Kalman update residuals, improving performance on sequential recommendation and noisy LLM fine-tuning.
This paper demonstrates that volatility and stochasticity, both sources of uncertainty, drive optimal exploration in opposite directions: volatility increases exploration while stochasticity suppresses it. The authors extend the Gittins index framework to Gaussian state-space bandits and introduce CAUSE, a closed-form exploration bonus that outperforms standard strategies.
The article argues that AI in medicine may fail due to poor calibration and inability to express uncertainty, rather than lack of eloquence, and calls for features that build trust.
This paper evaluates six open-weight LLMs on biomedical QA under conflicting evidence conditions, revealing accuracy drops and prediction flips, and proposes a conflict-aware abstention score that improves selective accuracy.
Senior developers often fail to communicate effectively with business teams because they overemphasize code complexity, while business teams truly care about eliminating uncertainty. The article suggests developers use "Can we try a faster approach?" to align both sides, and points out that although AI can write code quickly, humans still take responsibility.
TwinTrack is a post-hoc calibration framework for pancreatic cancer segmentation that aligns ensemble model probabilities with the empirical mean human response across multiple annotators, improving interpretability and calibration metrics on multi-rater benchmarks.
OpenAI publishes research explaining that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty, and proposes that evaluation metrics should prioritize honesty about limitations over raw accuracy.