calibration

#calibration

Output-Space Allocation Costs for Calibration-Guided LLM Compression: An Empirical Study

arXiv cs.CL ↗ · 19h ago Cached

This paper empirically investigates whether aligning the allocation cost with the output-space objective improves compressed model fidelity in ROCKET, a training-free LLM compression method. Results show a trade-off between accuracy and perplexity, with effects more pronounced at higher compression ratios.

0 favorites 0 likes

#calibration

PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration

arXiv cs.LG ↗ · 19h ago Cached

Introduces PEBS, a per-rater empirical-Bayes shrinkage estimator for calibrating reward models in RLHF, reducing within-user RMSE by over 8.5% on PRISM and over 9.6% on PluriHarms.

0 favorites 0 likes

#calibration

@FinanceYF5: The Platonic representation hypothesis is mostly a statistical illusion. New research shows that the apparent 'global convergence' in scaled AI models is actually a mathematical artifact caused by selection bias in model width and depth. Once calibrated, global convergence disappears.

X AI KOLs Following ↗ · yesterday Cached

New research indicates that the apparent 'global convergence' in scaled AI models is actually a statistical illusion caused by selection bias in model width and depth, and disappears once calibrated.

0 favorites 0 likes

#calibration

We built a calibration-aware Q4_K_M quant of Qwen3.5 0.8B that recovers 96.5% of the BF16 gap vs pure llama.cpp Q4_K_M (SpectralQuant)

Reddit r/LocalLLaMA ↗ · 2d ago

A calibration-aware Q4_K_M quantization of Qwen3.5 0.8B using SpectralQuant recovers 96.5% of the BF16 performance gap compared to the standard llama.cpp Q4_K_M quant.

0 favorites 0 likes

#calibration

@_akhaliq: paper:

X AI KOLs Following ↗ · 3d ago Cached

This paper proposes Robust-TO, an agentic video understanding framework that integrates per-frame trustworthiness to address the Blind Trust Problem, achieving significant accuracy gains under realistic perturbations.

0 favorites 0 likes

#calibration

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv cs.CL ↗ · 4d ago Cached

This paper evaluates the reliability of automated judges used to measure attack success rates (ASR) in LLM jailbreak research, finding that both safety classifiers and LLM-as-judges have significant calibration and adversarial robustness issues that undermine reported ASR numbers.

0 favorites 0 likes

#calibration

Don't Go Breaking My LLM: The Impact of Pruning Attention Layers on Explanation Faithfulness and Confidence Calibration

arXiv cs.LG ↗ · 4d ago Cached

This paper studies how pruning attention layers in LLMs affects explanation faithfulness and confidence calibration, finding that accuracy often remains high but interpretability and reliability degrade, highlighting a misalignment between model confidence, interpretability, and accuracy.

0 favorites 0 likes

#calibration

Conformal Orbit-Valid Trust Horizons for Equivariant World Models

arXiv cs.LG ↗ · 4d ago Cached

This paper proposes a method to certify the trust horizon of latent world models with known group symmetries by calibrating a raw error-propagation curve using split-conformal prediction and leveraging equivariance to transport certificates over the entire group orbit. The approach provides finite-sample guarantees and demonstrates non-vacuous certificates on symmetric 2D and 3D substrates.

0 favorites 0 likes

#calibration

When Top-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs

arXiv cs.LG ↗ · 5d ago Cached

This paper investigates the effectiveness of top-1 collapse rate as a stability monitor for short-horizon LoRA fine-tuning of discrete diffusion language models, finding it has zero precision, and proposes max gradient norm as a more reliable alternative with higher precision and F1 score on LLaDA-family models.

0 favorites 0 likes

#calibration

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

arXiv cs.CL ↗ · 5d ago Cached

The paper introduces CALIBER, a method for calibrating confidence in reasoning language models by eliciting confidence estimates both before and after reasoning, with supervision targets matched to the information state. It achieves significant reductions in Expected Calibration Error (up to 52.5%) and strong Brier scores and AUROC across multiple benchmarks.

0 favorites 0 likes

#calibration

Calibrating 2-bit GGUFs (<10Gb) for agentic coding tasks

Reddit r/LocalLLaMA ↗ · 2026-06-18

This article introduces calibrated 2-bit GGUF quantizations of the Qwopus3.6-27B-Coder model for agentic coding tasks, demonstrating that the IQ2_M quant (9.74 GiB) achieves a 63% pass rate on the SWE-rebench benchmark, comparable to a Q5_K_M quant at half the size.

0 favorites 0 likes

#calibration

Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

arXiv cs.CL ↗ · 2026-06-17 Cached

This paper investigates verbalized methods for extracting LLM confidence in machine translation outputs, comparing them with internal token probabilities. The study finds that while both approaches perform similarly in error detection and calibration, there is little correlation between internal and verbalized confidence measures.

0 favorites 0 likes

#calibration

False Sense of Safety in Selective Signal Classification: Auditing Bound Tightness and Exchangeability for Risk Control

arXiv cs.LG ↗ · 2026-06-16 Cached

This paper audits the reliability of distribution-free risk control methods for selective classification in signal-domain detectors, finding that naive thresholding often exceeds its declared budget and that exchangeability violations cause certificate failures.

0 favorites 0 likes

#calibration

Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL

arXiv cs.AI ↗ · 2026-06-15 Cached

LLM agents often mis-assess their own performance after observing environment feedback, a problem called the reflection gap. RefGRPO addresses this by augmenting RL with a free calibration bonus and dynamic scheduling, reducing underconfidence from 44.4% to 7.7% and improving task accuracy on text-to-SQL benchmarks.

0 favorites 0 likes

#calibration

Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage

arXiv cs.LG ↗ · 2026-06-15 Cached

This paper introduces SLC (State-space Logit Correction), which corrects per-item logit bias in knowledge tracing models using empirical-Bayes shrinkage via a Kalman smoother, improving AUC beyond global calibration techniques.

0 favorites 0 likes

#calibration

Non-Parametric Machine Text Detection via Multi-View Gaussian Processes

arXiv cs.LG ↗ · 2026-06-15 Cached

This paper introduces a non-parametric multi-view Gaussian process framework for detecting machine-generated text that is robust to adversarial manipulations like paraphrasing. By combining complementary features and providing calibrated uncertainty, it outperforms existing detectors on held-out attacks.

0 favorites 0 likes

#calibration

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

TuneJury is an open-source pairwise reward model for text-to-music generation that provides calibrated preference scoring and generalizes across multiple downstream applications.

0 favorites 0 likes

#calibration

Strategic Decision Support for AI Agents

arXiv cs.AI ↗ · 2026-06-12 Cached

This paper proposes a framework for strategic decision support for AI agents, formulating an optimization problem to minimize support usage while controlling missed-support error. The authors develop an online algorithm and calibration method, demonstrating effectiveness across information gathering, human-AI collaboration, and tool use scenarios.

0 favorites 0 likes

#calibration

Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper identifies Calibration Drift Under Reasoning (CDUR), where increasing chain-of-thought reasoning budgets causes LLMs to become systematically overconfident in incorrect answers, and proposes a Hypothesis Lock-In model and a calibration-aware stopping rule (CABStop) to mitigate the issue.

0 favorites 0 likes

#calibration

Toward Calibrated, Fair, and accurate Deepfake Detection

arXiv cs.LG ↗ · 2026-06-10 Cached

Introduces Face-Fairness (FF), a plug-and-play framework for bias mitigation in deepfake detection, featuring Face-Feature Tuning (FFT) as the first demographic label-free fairness method that improves group accuracy and reduces performance gaps across demographics.

0 favorites 0 likes

calibration

Submit Feedback