calibration

Tag

Cards List
#calibration

Calibrating 2-bit GGUFs (<10Gb) for agentic coding tasks

Reddit r/LocalLLaMA · 5d ago

This article introduces calibrated 2-bit GGUF quantizations of the Qwopus3.6-27B-Coder model for agentic coding tasks, demonstrating that the IQ2_M quant (9.74 GiB) achieves a 63% pass rate on the SWE-rebench benchmark, comparable to a Q5_K_M quant at half the size.

0 favorites 0 likes
#calibration

Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

arXiv cs.CL · 2026-06-17 Cached

This paper investigates verbalized methods for extracting LLM confidence in machine translation outputs, comparing them with internal token probabilities. The study finds that while both approaches perform similarly in error detection and calibration, there is little correlation between internal and verbalized confidence measures.

0 favorites 0 likes
#calibration

False Sense of Safety in Selective Signal Classification: Auditing Bound Tightness and Exchangeability for Risk Control

arXiv cs.LG · 2026-06-16 Cached

This paper audits the reliability of distribution-free risk control methods for selective classification in signal-domain detectors, finding that naive thresholding often exceeds its declared budget and that exchangeability violations cause certificate failures.

0 favorites 0 likes
#calibration

Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL

arXiv cs.AI · 2026-06-15 Cached

LLM agents often mis-assess their own performance after observing environment feedback, a problem called the reflection gap. RefGRPO addresses this by augmenting RL with a free calibration bonus and dynamic scheduling, reducing underconfidence from 44.4% to 7.7% and improving task accuracy on text-to-SQL benchmarks.

0 favorites 0 likes
#calibration

Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage

arXiv cs.LG · 2026-06-15 Cached

This paper introduces SLC (State-space Logit Correction), which corrects per-item logit bias in knowledge tracing models using empirical-Bayes shrinkage via a Kalman smoother, improving AUC beyond global calibration techniques.

0 favorites 0 likes
#calibration

Non-Parametric Machine Text Detection via Multi-View Gaussian Processes

arXiv cs.LG · 2026-06-15 Cached

This paper introduces a non-parametric multi-view Gaussian process framework for detecting machine-generated text that is robust to adversarial manipulations like paraphrasing. By combining complementary features and providing calibrated uncertainty, it outperforms existing detectors on held-out attacks.

0 favorites 0 likes
#calibration

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Hugging Face Daily Papers · 2026-06-15 Cached

TuneJury is an open-source pairwise reward model for text-to-music generation that provides calibrated preference scoring and generalizes across multiple downstream applications.

0 favorites 0 likes
#calibration

Strategic Decision Support for AI Agents

arXiv cs.AI · 2026-06-12 Cached

This paper proposes a framework for strategic decision support for AI agents, formulating an optimization problem to minimize support usage while controlling missed-support error. The authors develop an online algorithm and calibration method, demonstrating effectiveness across information gathering, human-AI collaboration, and tool use scenarios.

0 favorites 0 likes
#calibration

Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models

arXiv cs.CL · 2026-06-11 Cached

This paper identifies Calibration Drift Under Reasoning (CDUR), where increasing chain-of-thought reasoning budgets causes LLMs to become systematically overconfident in incorrect answers, and proposes a Hypothesis Lock-In model and a calibration-aware stopping rule (CABStop) to mitigate the issue.

0 favorites 0 likes
#calibration

Toward Calibrated, Fair, and accurate Deepfake Detection

arXiv cs.LG · 2026-06-10 Cached

Introduces Face-Fairness (FF), a plug-and-play framework for bias mitigation in deepfake detection, featuring Face-Feature Tuning (FFT) as the first demographic label-free fairness method that improves group accuracy and reduces performance gaps across demographics.

0 favorites 0 likes
#calibration

Calibrating Overconfidence Without Sacrificing Confidence: Probe-Conditioned Head Intervention for LLMs

arXiv cs.LG · 2026-06-10 Cached

The paper introduces Probe-Conditioned Head Intervention (PCHI), an inference-time method for LLMs that selectively reduces overconfidence on wrong answers without significantly reducing confidence on correct ones, by conditionally rescaling attention head outputs when the model is likely wrong but confident.

0 favorites 0 likes
#calibration

Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models

arXiv cs.CL · 2026-06-10 Cached

The paper proposes TRACE, a method for machine unlearning in Mixture-of-Experts language models that calibrates retain regularization by reweighting token-level retain losses to address forget-retain routing mismatch. Experiments show improved forget-utility trade-off across multiple MoE LLMs.

0 favorites 0 likes
#calibration

Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

arXiv cs.CL · 2026-06-10 Cached

This paper introduces Program-based Posterior Training (PPT), a method that uses LLM-generated probabilistic programs to create distributional targets for fine-tuning inductive reasoning, improving estimation accuracy and calibration on held-out tasks and human-alignment benchmarks.

0 favorites 0 likes
#calibration

FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models

arXiv cs.LG · 2026-06-08 Cached

This paper proposes FAIR-Calib, a two-stage post-training quantization framework for diffusion large language models that addresses the instability of token commitments during iterative refinement. It achieves state-of-the-art results on LLaDA and Dream models under low-bit quantization.

0 favorites 0 likes
#calibration

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

Hugging Face Daily Papers · 2026-06-08 Cached

TRIAGE is a framework that trains LLMs to generate dialectical reasoning for continuous risk scoring from irregularly sampled medical time series, achieving improved calibration and interpretability.

0 favorites 0 likes
#calibration

@HuggingPapers: NVIDIA just released the Anchor Lab dataset on Hugging Face Real-world robotics measurements to calibrate simulation ag…

X AI KOLs Following · 2026-06-05 Cached

NVIDIA released the Anchor Lab dataset on Hugging Face, containing real-world robotics measurements for calibrating simulation to enable zero-shot sim-to-real deployment.

0 favorites 0 likes
#calibration

The best AI “science critics” are also the most overconfident — a benchmark on calibration vs. skill

Reddit r/artificial · 2026-06-05

The article introduces the Refute benchmark, which tests LLMs on critiquing science paper summaries and measures their calibration. Results show that the best critic models are often the most overconfident when wrong.

0 favorites 0 likes
#calibration

Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]

Reddit r/MachineLearning · 2026-06-04

A practitioner discusses the calibration vs. utility tradeoff in LLM agents, sharing experience with a verifier-based pipeline that reduces hallucinated tool calls by ~60% but introduces latency costs and drops easy correct answers.

0 favorites 0 likes
#calibration

Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals

arXiv cs.LG · 2026-06-03 Cached

This paper introduces a plug-in calibration module that adjusts multimodal representations before fusion, using cross-modal context to suppress misleading signals and emphasize reliable ones, improving performance on multiple benchmarks.

0 favorites 0 likes
#calibration

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

arXiv cs.CL · 2026-06-03 Cached

This paper geometrically analyzes why LLMs acting as judges agree strongly with each other but weakly with humans, finding that inter-LLM consensus reflects a collapsed subspace rather than true human alignment on subjective rubrics. Post-hoc calibration on human data improves alignment, but even calibrated LLMs fall short of human reliability.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback