Tag
This paper studies how pruning attention layers in LLMs affects explanation faithfulness and confidence calibration, finding that accuracy often remains high but interpretability and reliability degrade, highlighting a misalignment between model confidence, interpretability, and accuracy.
This paper proposes a cycle-consistent neural architecture that generates faithful natural language explanations of formal verification certificates, achieving 90% soundness and 860x faster inference than LLM baselines.
This paper introduces CAMS, a modular multi-document summarization framework that extracts atomic claims with token-level provenance, clusters equivalent claims, and rewrites them into summaries with fine-grained, multi-source traceability, significantly improving faithfulness and citation precision.
This paper introduces a trace-level diagnostic for evaluating chain-of-thought reasoning, separating susceptibility (whether bias changes the answer) from acknowledgment (whether the trace flags the biased input). Experiments show models like GPT-4o and Claude Sonnet 4 have similar susceptibility rates but very different acknowledgment rates, highlighting a blind spot in accuracy-only evaluation.
This paper extends optimal transport-based hallucination detection to all decoder layers in NMT and abstractive summarization, finding that detection is concentrated in early layers and that the geometric signal transfers poorly to summarization due to faithfulness failures not detectable via attention concentration.
This paper proposes Detect–Remask–Repair, a diffusion-based framework for localized faithfulness repair in summarization when contexts evolve, and introduces the StreamSum benchmark for evaluating such settings. Experiments show it offers controllable trade-offs between faithfulness, speed, and content preservation.
LatticeBridge proposes a twisted sequential Monte Carlo decoder for structured sequence generation that improves constraint satisfaction by treating the problem as rare-event inference, outperforming greedy and beam baselines on CommonGen, E2E NLG, and WikiBio.
This paper introduces FullCite, a framework for generating structured inline citations that link each claim to both its source document and specific evidence spans. Evaluated on three QA benchmarks (ASQA, BioASQ, ExpertQA), it finds that while LLMs are good at document-level attribution, they struggle with precise evidence span identification.
This paper introduces compatibility and incompatibility scores for evaluating collections of bivariate causal statements without relying on faithfulness, and demonstrates their applicability by analyzing causal claims from large language models.
OCC-RAG introduces a family of compact small language models optimized for faithful question answering, using a novel pipeline to synthesize multi-context multi-hop QA data. The models demonstrate competitive performance against larger models on reasoning and faithfulness benchmarks.
This paper identifies a novel failure mode in reasoning models called unfaithful capitulation, where the chain-of-thought remains factually correct across adversarial multi-turn dialogues but the final answer flips wrong, highlighting limitations of current evaluation methods.
The paper proposes the Unlearning Depth Score (UDS), a metric that uses activation patching to quantify how thoroughly target knowledge is erased from LLMs, achieving state-of-the-art faithfulness and robustness across multiple unlearning methods.
This paper proposes a framework to evaluate and improve faithfulness of chain-of-thought reasoning by controlling information flow, using entropy-based, KL-divergence, and gradient-based diagnostics, and introduces training interventions (attention masking, gradient masking, adversarial perturbations) that make reasoning more transparent and reduce shortcut reliance.
This paper introduces BonaFide, a benchmark of 3,066 labeled chain-of-thought examples across 13 tasks and 10 models, and systematically evaluates faithfulness metrics, showing that most perform near chance and have significant limitations in reliability and efficiency.
Faithful-MR1 is a training framework that improves faithful multimodal reasoning in MLLMs by anchoring visual attention via a <Focus> token and reinforcing faithful use through counterfactual image intervention. It outperforms baselines on Qwen2.5-VL backbones with less training data.
This paper proposes an adversarial Sobolev alignment method for faithful image super resolution, aiming to reduce artifacts and improve fidelity.
This article discusses the importance of faithfulness in LLM optimization, introducing a Structural Fidelity Score that measures drift across word overlap, constraint survival, and task-type match to ensure prompt optimization does not sacrifice intent.
This paper proposes Retrieval-Augmented Linguistic Calibration (RALC), a post-hoc pipeline for calibrating confidence signals in LLMs by modeling linguistic confidence as a distribution and using retrieval-augmented rewriting. It introduces Faithfulness Divergence metric and shows significant improvements across benchmarks.
This paper investigates the trade-off between plausibility and faithfulness in cross-lingual explanations from LLMs, finding that English-pivot explanations achieve higher span agreement with human rationales but suffer reduced causal faithfulness compared to native-language explanations.
This paper empirically examines the tradeoff between fluency and faithfulness in literary translation using 130,486 paragraphs from 106 novels, finding a consistent negative correlation for human and Google Translate translations, but weaker for TranslateGemma.