Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics
Summary
This paper introduces a method to improve factual consistency in text summarization by aggregating scores from multiple weak metrics via preference learning, achieving consistent factuality gains across various language models.
View Cached Full Text
Cached at: 05/27/26, 09:11 AM
# Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics Source: [https://arxiv.org/abs/2605.26840](https://arxiv.org/abs/2605.26840) [View PDF](https://arxiv.org/pdf/2605.26840) > Abstract:Reinforcement learning with evaluation metrics as rewards is widely used to enhance specific capabilities of language models\. However, for tasks such as factually consistent summarisation, existing metrics remain underdeveloped, limiting their effectiveness as signals for shaping model[this http URL](http://behaviour.while/)individual factuality metrics are unreliable, their combination can more effectively capture diverse factual errors\. We leverage this insight to introduce an automated training pipeline that improves factual consistency in summaries by aggregating scores from different weak metrics\. Our approach avoids the need for complex reward shaping by mapping scores to preferences and filtering out cases with high disagreement between metrics\. For each source document, we generate lexically similar summary pairs by varying decoding strategies, enabling the model to learn from factual differences caused by subtle lexical differences\. This approach constructs a high\-quality preference dataset using only source[this http URL](http://documents.experiments/)demonstrate consistent factuality gains across models, ranging from early encoder\-decoder architectures to modern large language models, with smaller models reaching comparable factuality to larger ones\. ## Submission history From: Yuxuan Ye \[[view email](https://arxiv.org/show-email/5733012b/2605.26840)\] **\[v1\]**Tue, 26 May 2026 10:55:03 UTC \(21,531 KB\)
Similar Articles
Learning to summarize with human feedback
OpenAI demonstrates a technique for improving language model summarization by training a reward model on human preferences and fine-tuning models with reinforcement learning, achieving significant quality improvements that generalize across datasets. This work advances model alignment through human feedback at scale, with applications beyond summarization.
Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning
This paper introduces PolyFact, a large-scale multilingual factual QA dataset, and demonstrates that reinforcement learning via GRPO significantly improves cross-lingual factual consistency in LLMs compared to supervised fine-tuning, by reorganizing multilingual representations.
Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets
This paper proposes an evidence-based model to automatically generate query keywords from query-free summarization datasets, enabling the creation of query-focused summarization datasets. Experimental results show that summaries generated using evidence-based queries achieve competitive ROUGE scores compared to original queries.
A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs
Researchers from Jilin University systematically evaluate positional bias in multi-video summarization using MLLMs, constructing a benchmark from ActivityNet and News videos and assessing nine models with metrics including Coverage, Directional Positional Bias, and Middle-Edge Gap. Results show positional effects are domain- and model-dependent, and increasing visual or generation budget does not uniformly resolve the imbalance.
Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)
This paper presents SSAS (Syntactic & Semantic Context Assessment Summarization), a framework designed to improve consistency in LLM-based sentiment prediction by reducing noise and variance through hierarchical classification and iterative summarization. Empirical evaluation on three industry-standard datasets shows up to 30% improvement in data quality and reliability for enterprise decision-making.