Tag
This article introduces calibrated 2-bit GGUF quantizations of the Qwopus3.6-27B-Coder model for agentic coding tasks, demonstrating that the IQ2_M quant (9.74 GiB) achieves a 63% pass rate on the SWE-rebench benchmark, comparable to a Q5_K_M quant at half the size.
This paper investigates verbalized methods for extracting LLM confidence in machine translation outputs, comparing them with internal token probabilities. The study finds that while both approaches perform similarly in error detection and calibration, there is little correlation between internal and verbalized confidence measures.
This paper audits the reliability of distribution-free risk control methods for selective classification in signal-domain detectors, finding that naive thresholding often exceeds its declared budget and that exchangeability violations cause certificate failures.
LLM agents often mis-assess their own performance after observing environment feedback, a problem called the reflection gap. RefGRPO addresses this by augmenting RL with a free calibration bonus and dynamic scheduling, reducing underconfidence from 44.4% to 7.7% and improving task accuracy on text-to-SQL benchmarks.
This paper introduces SLC (State-space Logit Correction), which corrects per-item logit bias in knowledge tracing models using empirical-Bayes shrinkage via a Kalman smoother, improving AUC beyond global calibration techniques.
This paper introduces a non-parametric multi-view Gaussian process framework for detecting machine-generated text that is robust to adversarial manipulations like paraphrasing. By combining complementary features and providing calibrated uncertainty, it outperforms existing detectors on held-out attacks.
TuneJury is an open-source pairwise reward model for text-to-music generation that provides calibrated preference scoring and generalizes across multiple downstream applications.
This paper proposes a framework for strategic decision support for AI agents, formulating an optimization problem to minimize support usage while controlling missed-support error. The authors develop an online algorithm and calibration method, demonstrating effectiveness across information gathering, human-AI collaboration, and tool use scenarios.
This paper identifies Calibration Drift Under Reasoning (CDUR), where increasing chain-of-thought reasoning budgets causes LLMs to become systematically overconfident in incorrect answers, and proposes a Hypothesis Lock-In model and a calibration-aware stopping rule (CABStop) to mitigate the issue.
Introduces Face-Fairness (FF), a plug-and-play framework for bias mitigation in deepfake detection, featuring Face-Feature Tuning (FFT) as the first demographic label-free fairness method that improves group accuracy and reduces performance gaps across demographics.
The paper introduces Probe-Conditioned Head Intervention (PCHI), an inference-time method for LLMs that selectively reduces overconfidence on wrong answers without significantly reducing confidence on correct ones, by conditionally rescaling attention head outputs when the model is likely wrong but confident.
The paper proposes TRACE, a method for machine unlearning in Mixture-of-Experts language models that calibrates retain regularization by reweighting token-level retain losses to address forget-retain routing mismatch. Experiments show improved forget-utility trade-off across multiple MoE LLMs.
This paper introduces Program-based Posterior Training (PPT), a method that uses LLM-generated probabilistic programs to create distributional targets for fine-tuning inductive reasoning, improving estimation accuracy and calibration on held-out tasks and human-alignment benchmarks.
This paper proposes FAIR-Calib, a two-stage post-training quantization framework for diffusion large language models that addresses the instability of token commitments during iterative refinement. It achieves state-of-the-art results on LLaDA and Dream models under low-bit quantization.
TRIAGE is a framework that trains LLMs to generate dialectical reasoning for continuous risk scoring from irregularly sampled medical time series, achieving improved calibration and interpretability.
NVIDIA released the Anchor Lab dataset on Hugging Face, containing real-world robotics measurements for calibrating simulation to enable zero-shot sim-to-real deployment.
The article introduces the Refute benchmark, which tests LLMs on critiquing science paper summaries and measures their calibration. Results show that the best critic models are often the most overconfident when wrong.
A practitioner discusses the calibration vs. utility tradeoff in LLM agents, sharing experience with a verifier-based pipeline that reduces hallucinated tool calls by ~60% but introduces latency costs and drops easy correct answers.
This paper introduces a plug-in calibration module that adjusts multimodal representations before fusion, using cross-modal context to suppress misleading signals and emphasize reliable ones, improving performance on multiple benchmarks.
This paper geometrically analyzes why LLMs acting as judges agree strongly with each other but weakly with humans, finding that inter-LLM consensus reflects a collapsed subspace rather than true human alignment on subjective rubrics. Post-hoc calibration on human data improves alignment, but even calibrated LLMs fall short of human reliability.