Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets

arXiv cs.CL Papers

Summary

This paper proposes an evidence-based model to automatically generate query keywords from query-free summarization datasets, enabling the creation of query-focused summarization datasets. Experimental results show that summaries generated using evidence-based queries achieve competitive ROUGE scores compared to original queries.

arXiv:2605.05392v1 Announce Type: new Abstract: Large-scale datasets are widely used to perform summarization tasks, but they may not include queries alongside documents and summaries. In the search for suitable datasets for Query-Focused Summarization (QFS), we identify two research questions: Is it possible to automatically generate evidence-based query keywords from query-free datasets? Does evidence-based query generation support the QFS task? This paper proposes an evidence-based model to generate queries from query-free datasets. To evaluate our model intrinsically, we compare the similarity between the original queries and the system-generated queries of two QFS datasets. We also perform summarization tasks using different pre-trained models, as well as a state-of-the-art (SOTA) QFS model, to measure the extrinsic performance of our query generation approach. Experimental results indicate that summaries generated using evidence-based queries achieve competitive ROUGE scores compared to those generated from the original queries.
Original Article
View Cached Full Text

Cached at: 05/08/26, 06:25 AM

# Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets
Source: [https://arxiv.org/html/2605.05392](https://arxiv.org/html/2605.05392)
Deen Abdullah University of Lethbridge Alberta, Canada deen\.abdullah@uleth\.ca

###### Abstract

Large\-scale datasets are widely used to perform summarization tasks, but they may not include queries alongside documents and summaries\. In the search for suitable datasets for Query\-Focused Summarization \(QFS\), we identify two research questions: Is it possible to automatically generate evidence\-based query keywords from query\-free datasets? Does evidence\-based query generation support the QFS task? This paper proposes an evidence\-based model to generate queries from query\-free datasets\. To evaluate our model intrinsically, we compare the similarity between the original queries and the system\-generated queries of two QFS datasets\. We also perform summarization tasks using different pre\-trained models, as well as a state\-of\-the\-art \(SOTA\) QFS model, to measure the extrinsic performance of our query generation approach\. Experimental results indicate that summaries generated using evidence\-based queries achieve competitive ROUGE scores compared to those generated from the original queries\.

Generating Query\-Focused Summarization Datasets from Query\-Free Summarization Datasets

Yllias ChaliUniversity of LethbridgeAlberta, Canadayllias\.chali@uleth\.caDeen AbdullahUniversity of LethbridgeAlberta, Canadadeen\.abdullah@uleth\.ca

## 1Introduction

Query\-focused summarization \(QFS\) focuses on generating summaries from original documents, where the summary is tailored to a specific given query\. Researchers have implemented various neural models and proposed distinctive approaches that advance both extractive and abstractive query\-focused summarization tasks\(Lin,[2004](https://arxiv.org/html/2605.05392#bib.bib1); Guptaet al\.,[2007](https://arxiv.org/html/2605.05392#bib.bib2); Wanet al\.,[2007](https://arxiv.org/html/2605.05392#bib.bib3); Ouyanget al\.,[2011](https://arxiv.org/html/2605.05392#bib.bib4); Feigenblatet al\.,[2017](https://arxiv.org/html/2605.05392#bib.bib5); Nemaet al\.,[2017](https://arxiv.org/html/2605.05392#bib.bib6); Hasselqvistet al\.,[2017](https://arxiv.org/html/2605.05392#bib.bib7); Baumelet al\.,[2018](https://arxiv.org/html/2605.05392#bib.bib8); Abdullah and Chali,[2020](https://arxiv.org/html/2605.05392#bib.bib9); Xu and Lapata,[2020](https://arxiv.org/html/2605.05392#bib.bib10); Laskaret al\.,[2020](https://arxiv.org/html/2605.05392#bib.bib11); Suet al\.,[2021](https://arxiv.org/html/2605.05392#bib.bib12)\)\. However, the lack of appropriate datasets for QFS has always been a concern for researchers, as the unavailability of large\-scale datasets makes the task more challenging\(Fisher and Roark,[2006](https://arxiv.org/html/2605.05392#bib.bib13); Seeet al\.,[2017](https://arxiv.org/html/2605.05392#bib.bib14); Liu and Lapata,[2019](https://arxiv.org/html/2605.05392#bib.bib15); Abdullah and Chali,[2020](https://arxiv.org/html/2605.05392#bib.bib9)\)\. Thus, the shortage of large\-scale QFS datasets motivates the development of effective query generation approaches\. To address this issue, we propose a context\-oriented, evidence\-based model that supports query\-focused summarization by generating queries from documents in any query\-free dataset\. Using a transfer learning approach, we train our evidence\-based model on the article, highlight pairs from the CNN/DailyMail dataset\. To avoid data bias, we use different datasets, such as Debatepedia and TD\-QFS, for the QFS task instead of CNN/DailyMail\. Additionally, both datasets contain queries, allowing us to compare the performance of original queries with that of the generated evidence\-based queries\. Samples of the original and evidence\-based queries from the TD\-QFS dataset are shown in Table 1\.

Table 1:Sample queries \(original and the evidence\-based\) from TD\-QFS dataset\.
## 2Related Work

Pre\-trained models, including BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2605.05392#bib.bib16)\), GPT\(Radfordet al\.,[2019](https://arxiv.org/html/2605.05392#bib.bib17)\), RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2605.05392#bib.bib18)\), T5\(Raffelet al\.,[2020](https://arxiv.org/html/2605.05392#bib.bib19)\), LED\(Beltagyet al\.,[2020](https://arxiv.org/html/2605.05392#bib.bib20)\), BART\(Lewiset al\.,[2020](https://arxiv.org/html/2605.05392#bib.bib21)\), and PEGASUS\(Zhanget al\.,[2020](https://arxiv.org/html/2605.05392#bib.bib22)\), have been widely used on various datasets such as Gigaword\(Ma and Huang,[2006](https://arxiv.org/html/2605.05392#bib.bib23)\), CNN/DailyMail\(Hermannet al\.,[2015](https://arxiv.org/html/2605.05392#bib.bib24)\), SQuAD\(Rajpurkaret al\.,[2016](https://arxiv.org/html/2605.05392#bib.bib25)\), TD\-QFS\(Baumelet al\.,[2016](https://arxiv.org/html/2605.05392#bib.bib26)\), and Debatepedia\(Nemaet al\.,[2017](https://arxiv.org/html/2605.05392#bib.bib6)\)to perform summarization, machine translation, and other NLP tasks\(Rushet al\.,[2015](https://arxiv.org/html/2605.05392#bib.bib27); Nallapatiet al\.,[2016](https://arxiv.org/html/2605.05392#bib.bib28); Durrettet al\.,[2016](https://arxiv.org/html/2605.05392#bib.bib29)\)\.

By emphasizing a query\-based attention mechanism,Nemaet al\.\([2017](https://arxiv.org/html/2605.05392#bib.bib6)\)implemented a diversity\-driven model that reduces repetitive phrases in summaries\.Abdullah and Chali \([2020](https://arxiv.org/html/2605.05392#bib.bib9)\)proposed a query generation approach that considers both the input document and the target summary\. Similarly,Xu and Lapata \([2020](https://arxiv.org/html/2605.05392#bib.bib10)\)addressed the problem of query–cluster interaction and proposed a coarse\-to\-fine model for query\-focused multi\-document extractive summarization\.

In this paper, we propose an evidence\-based model that leverages a transfer learning approach to generate evidence\-based queries\. First, we train a model to generate evidence keywords from article, highlight pairs in the CNN/DailyMail dataset\. We then use this evidence model to generate evidence\-based queries for the Debatepedia and TD\-QFS datasets\.

## 3Problem Definition

Given a queryQiQ\_\{i\}and a documentDiD\_\{i\}, we generate a query\-relevant summarySiS\_\{i\}in the QFS task\. A query should focus on the parts of the document that are related to its keywords, while the summary should cover the corresponding query\-relevant contexts within the document\. We define the common context words present in both the document and the summary as evidence in this work\. However, the challenge lies in generating such evidence using only the document\. Therefore, we hypothesize that a transfer learning approach can help train an evidence\-based model for the document\-to\-evidence\-based query generation task\.

## 4Our Framework

In query\-focused summarization, the summary should align with the query, meaning that query\-related information must be present in the document\. Therefore, the query should be supported by both the summary and the document, implying that evidence keywords must be reflected in the query\. However, only a few QFS datasets provide query–document–summary triads, and these are often constructed based on simplifying assumptions\. For example, in the Debatepedia dataset, questions from controversial debates are treated as queries, while topic titles are treated as summaries\. In such cases, some titles \(summaries\) may not be fully relevant to the queries\.

Motivated by this limitation, we investigate whether evidence keywords can perform better than the original queries in QFS datasets\. If our hypothesis holds, evidence\-based queries could be applied to query\-free resources for QFS tasks\. Therefore, we propose an evidence\-based model that generates evidence from documents to be used as queries\. Our work111Our code is available at https://anonymous\.4open\.science/r/Our\-Projectconsists of two main steps\. First, we fine\-tune a pre\-trained model on the CNN/DailyMail dataset for the document\-to\-query generation task\. Then, using the evidence\-based model, we generate evidence\-based queries for the Debatepedia and TD\-QFS datasets without accessing the summaries\. This transfer learning approach helps us avoid target leakage while generating evidence\-based queries\.

We use the CNN/DailyMail dataset, which contains news article–highlight pairs, to generate evidence\-based queries\. Specifically, we extract the common words from news articles and their corresponding highlights and consider them as evidence using Equation 1:

Ei←\{wi​j\}​\(i​f​wi​j=wi​k\)E\_\{i\}\\leftarrow\\\{w\_\{ij\}\\\}\\;\(if\\;w\_\{ij\}=w\_\{ik\}\)\(1\)
whereEiE\_\{i\}is the set of extracted evidence keywords from the news article\(Ni\)\(N\_\{i\}\)and the highlight\(Hi\)\(H\_\{i\}\)of thei​t​hi\{th\}sample\. Here,wi​jw\_\{ij\}are the tokenized words fromNiN\_\{i\}, andwi​kw\_\{ik\}are the tokenized words fromHiH\_\{i\}\.

The T5 model has been successfully applied to various downstream tasks; therefore, we select it as our pre\-trained model and fine\-tune it for the evidence generation task\. The news articles are fed into the encoder, and the extracted evidence is provided to the decoder for supervised fine\-tuning\. We use the cross\-entropy loss function to update the model parameters during backpropagation\. The overall architecture of our evidence\-based model is illustrated in Figure 1\.

![Refer to caption](https://arxiv.org/html/2605.05392v1/fig1.png)Figure 1:Evidence Model \- Fine\-tuning T5 in CNN/DM \(News articles, Highlights\)
## 5Evaluation Detail

### 5\.1Intrinsic Evaluation

To determine the similarity between our evidence\-based queries and the original queries, we performed an intrinsic evaluation using the open\-source library spaCy222https://spacy\.io/\.

### 5\.2Extrinsic Evaluation

#### 5\.2\.1Summarization using Pre\-trained Models

After generating the evidence\-based queries, we ranked the sentences in each document according to their relevance to the generated queries, thereby transforming the documents into query\-relevant inputs\. This ranking ensures that query\-related sentences appear at the beginning of the document, reducing the risk that important information is truncated due to input size limitations\. Finally, we fine\-tuned several pre\-trained summarization models \(four models were used separately in our experiments\) on the query\-relevant documents from the Debatepedia dataset to generate query\-focused summaries\.

Sentence Ranking

We used the Debatepedia dataset for our QFS task, where sentence ranking helps prepare documents as query\-relevant text inputs\. First, we applied the evidence\-based model to generate evidence\-based queries for the documents in the Debatepedia dataset\. Then, for each sample, we split the document into a list of sentences and converted all texts—including the generated evidence\-based query and the document sentences—into their corresponding vector representations using Equations 2, 3, and 4, respectively\. Next, we computed the similarity between each sentence and the query using spaCy’s similarity metric, as shown in Equation 5\. Finally, we sorted all sentences in descending order of their similarity scores to construct a query\-relevant document, as described in Equation 6\.

Si=s​e​n​t​e​n​c​e​T​o​k​e​n​i​z​a​t​i​o​n​\(Di\)S\_\{i\}=sentenceTokenization\(D\_\{i\}\)\(2\)Eiv​e​c=D​o​c​2​V​e​c​\(Ei\)E\_\{i\}^\{vec\}=Doc2Vec\(E\_\{i\}\)\(3\)sjv​e​c=D​o​c​2​V​e​c​\(sj\);\[sj∈Si\]s\_\{j\}^\{vec\}=Doc2Vec\(s\_\{j\}\);\[s\_\{j\}\\in S\_\{i\}\]\(4\)sjs​i​m=s​p​a​C​y\.s​i​m​i​l​a​r​i​t​y​\(Eiv​e​c,sjv​e​c\)s\_\{j\}^\{sim\}=spaCy\.similarity\(E\_\{i\}^\{vec\},s\_\{j\}^\{vec\}\)\(5\)DiE=\{s1,s2,…,sp,sq,…,s\|Di\|\}D\_\{i\}^\{E\}=\\\{s\_\{1\},s\_\{2\},\\ldots,s\_\{p\},s\_\{q\},\\ldots,s\_\{\|D\_\{i\}\|\}\\\}\(6\)whereSiS\_\{i\}is the list of sentences of the documentDiD\_\{i\}in theit​hi^\{th\}sample\.Eiv​e​cE\_\{i\}^\{vec\}andsjv​e​cs\_\{j\}^\{vec\}are the vector representations of the evidenceEiE\_\{i\}and the sentencesjs\_\{j\}, respectively\.sjs​i​ms\_\{j\}^\{sim\}is the similarity score between evidence\-based query and thejt​hj^\{th\}sentence of the document\. Finally,DiED\_\{i\}^\{E\}represents the query focused document of theit​hi^\{th\}sample where\[∀p,q​sps​i​m≥sqs​i​m,p<q\]\[\\forall p,q\\;s\_\{p\}^\{sim\}\\geq s\_\{q\}^\{sim\},p<q\]

Summarization Model

We used transformer\-based pre\-trained models for our summarization task: PEGASUS, BART, RoBERTa, and LED\. Since these models can handle a limited number of input tokens \(1024 tokens for PEGASUS, BART, and LED, and 514 tokens for RoBERTa\), it is important to place the most query\-relevant tokens at the beginning of the input sequence during training to generate query\-focused summaries effectively\. Our sentence\-ranking approach ensures that the most query\-relevant sentences appear at the beginning of the input\.

Pre\-trained models are typically trained on specific downstream tasks and can be fine\-tuned for similar tasks on different datasets\. We selected PEGASUS, BART, RoBERTa, and LED because they are pre\-trained for summarization or sentence generation tasks\. We then fine\-tuned these models on query\-focused documents from the Debatepedia dataset for the QFS task to evaluate our hypotheses\.

#### 5\.2\.2Summarization using a SOTA QFS Model

We conducted another experiment using Query\-Sum\(Xu and Lapata,[2020](https://arxiv.org/html/2605.05392#bib.bib10)\), a recent state\-of\-the\-art \(SOTA\) QFS model, to compare the results obtained using the original queries and the evidence\-based queries on the TD\-QFS dataset\.

## 6Experimental Setup

### 6\.1Datasets

We used the CNN/DailyMail dataset to train the evidence\-based model, and the Debatepedia and TD\-QFS datasets to evaluate it using both the generated evidence\-based queries and the original queries\.

### 6\.2Implementation Details

We used 70K training and 1,337 validation samples from the CNN/DailyMail dataset to fine\-tune the evidence\-based model\. For the summarization task, we used 12K training and 719 validation samples from the Debatepedia dataset, along with five pre\-trained models: T5, PEGASUS, BART, RoBERTa, and LED\.

We used a similar parameter configuration to fine\-tune both the evidence generation and summarization models\. We set the number of epochs to 3, weight decay to 0\.01, and learning rate to 5e\-05\. We used the Adam optimizer withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, andϵ=1​e−08\\epsilon=1e^\{\-08\}\. The training batch size was set to 8, and the evaluation batch size to 32\. During fine\-tuning of the evidence\-based model, we set warmup steps to 5,000 and evaluated the model every 500 steps\. For the summarization models, we set warmup steps to 1,000 and evaluated the models every 250 steps\.

To implement the QuerySum model, we followed the instructions provided byXu and Lapata \([2020](https://arxiv.org/html/2605.05392#bib.bib10)\)for both experimental settings, using original and evidence\-based queries, respectively\.

## 7Results and Discussion

After performing the intrinsic evaluation, we computed the similarity scores between the original queries and the evidence\-based queries for the Debatepedia and TD\-QFS datasets, as shown in Table 2\. The Debatepedia dataset yields a lower similarity score since its queries are formulated as questions\. In contrast, both TD\-QFS queries and our evidence\-based queries are represented as sets of keywords\.

Table 2:Similarity score between original query and the evidence\-based queryWe did not aim to achieve state\-of\-the\-art results on the Debatepedia dataset for the QFS task; rather, our goal was to demonstrate that evidence\-based queries can perform better than the original queries available in the dataset across four different pre\-trained models\. Our experimental results are shown in Table 3\.

Table 3:Performance of Pegasus, BART, RoBERTa and LED models on Debatepedia dataset\. R\-1, R\-2 and R\-L stand for the F1 score of ROUGE 1, 2, and L, respectivelyFrom Table 3, we observe that our evidence\-based queries consistently outperform the original dataset queries in generating summaries across all four pre\-trained models\. BART achieves the highest ROUGE\-1, ROUGE\-2, and ROUGE\-L scores among the four models\. LED shows the second\-best performance, while RoBERTa obtains the lowest scores, except for the precision values of ROUGE\-1 and ROUGE\-L\.

By implementing the same setup as the Query\-Sum model, we obtained ROUGE scores that differ from those reported in the original paper\. Therefore, we report the results obtained from our own experimental environment\. After replacing the original queries with evidence\-based queries, we achieved a higher ROUGE\-SU4 score, while ROUGE\-1 and ROUGE\-2 remained very close to the original scores\. These results are presented in Table 4\.

Table 4:Performance using QuerySum\(Xu and Lapata,[2020](https://arxiv.org/html/2605.05392#bib.bib10)\)model on TD\-QFS dataset\. R\-1, R\-2 and R\-SU4 stand for the F1 score of ROUGE 1, 2, and SU4, respectivelyBased on the results in Tables 3 and 4, we conclude that our evidence\-based model successfully replaces the original queries in the Debatepedia and TD\-QFS datasets for the QFS task, achieving improved performance\. Hence, this evidence\-based model can support query\-free summarization datasets by generating queries from their documents for the QFS task\.

## 8Conclusion

In this work, we present an evidence\-based query generation model and provide comparative evidence that our approach successfully helps summarization models generate better summaries for query\-focused datasets\. In the future, we would like to extend our query generation approach to large\-scale query\-free datasets and further investigate how the generated queries support the QFS task\.

## References

- D\. M\. Abdullah and Y\. Chali \(2020\)Towards generating query to perform query focused abstractive summarization using pre\-trained model\.InProceedings of the 13th International Conference on Natural Language Generation,Dublin, Ireland,pp\. 80–85\.External Links:[Link](https://aclanthology.org/2020.inlg-1.11)Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1),[§2](https://arxiv.org/html/2605.05392#S2.p2.1)\.
- T\. Baumel, R\. Cohen, and M\. Elhadad \(2016\)Topic concentration in query focused summarization datasets\.Proceedings of the AAAI Conference on Artificial Intelligence30\(1\)\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/10323),[Document](https://dx.doi.org/10.1609/aaai.v30i1.10323)Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- T\. Baumel, M\. Eyal, and M\. Elhadad \(2018\)Query focused abstractive summarization: incorporating query relevance, multi\-document coverage, and summary length constraints into seq2seq models\.External Links:1801\.07704Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- I\. Beltagy, M\. E\. Peters, and A\. Cohan \(2020\)Longformer: the long\-document transformer\.External Links:2004\.05150Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- G\. Durrett, T\. Berg\-Kirkpatrick, and D\. Klein \(2016\)Learning\-based single\-document summarization with compression and anaphoricity constraints\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Berlin, Germany,pp\. 1998–2008\.External Links:[Link](https://aclanthology.org/P16-1188),[Document](https://dx.doi.org/10.18653/v1/P16-1188)Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- G\. Feigenblat, H\. Roitman, O\. Boni, and D\. Konopnicki \(2017\)Unsupervised query\-focused multi\-document summarization using the cross entropy method\.InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’17,New York, NY, USA,pp\. 961–964\.External Links:ISBN 9781450350228,[Link](https://doi.org/10.1145/3077136.3080690),[Document](https://dx.doi.org/10.1145/3077136.3080690)Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- S\. Fisher and B\. Roark \(2006\)Query\-focused summarization by supervised sentence ranking and skewed word distributions\.InProceedings of the Document Understanding Conference, DUC\-2006, New York, USA,Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- S\. Gupta, A\. Nenkova, and D\. Jurafsky \(2007\)Measuring importance and query relevance in topic\-focused multi\-document summarization\.InProceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions,Prague, Czech Republic,pp\. 193–196\.External Links:[Link](https://aclanthology.org/P07-2049)Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- J\. Hasselqvist, N\. Helmertz, and M\. Kågebäck \(2017\)Query\-based abstractive summarization using neural networks\.External Links:1712\.06100Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- K\. M\. Hermann, T\. Kocisky, E\. Grefenstette, L\. Espeholt, W\. Kay, M\. Suleyman, and P\. Blunsom \(2015\)Teaching machines to read and comprehend\.Advances in neural information processing systems28\.Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- M\. T\. R\. Laskar, E\. Hoque, and J\. Huang \(2020\)Query focused abstractive summarization via incorporating query relevance and transfer learning with transformer models\.InAdvances in Artificial Intelligence,C\. Goutte and X\. Zhu \(Eds\.\),Cham,pp\. 342–348\.External Links:ISBN 978\-3\-030\-47358\-7Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- M\. Lewis, Y\. Liu, N\. Goyal, M\. Ghazvininejad, A\. Mohamed, O\. Levy, V\. Stoyanov, and L\. Zettlemoyer \(2020\)BART: denoising sequence\-to\-sequence pre\-training for natural language generation, translation, and comprehension\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 7871–7880\.External Links:[Link](https://aclanthology.org/2020.acl-main.703),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.703)Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013)Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- Y\. Liu and M\. Lapata \(2019\)Text summarization with pretrained encoders\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),Hong Kong, China,pp\. 3730–3740\.External Links:[Link](https://aclanthology.org/D19-1387),[Document](https://dx.doi.org/10.18653/v1/D19-1387)Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized bert pretraining approach\.External Links:1907\.11692Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- W\. Ma and C\. Huang \(2006\)Uniform and effective tagging of a heterogeneous giga\-word corpus\.InProceedings of the Fifth International Conference on Language Resources and Evaluation \(LREC’06\),Genoa, Italy\.External Links:[Link](http://www.lrec-conf.org/proceedings/lrec2006/pdf/294_pdf.pdf)Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- R\. Nallapati, B\. Zhou, C\. dos Santos, Ç\. G?lçehre, and B\. Xiang \(2016\)Abstractive text summarization using sequence\-to\-sequence RNNs and beyond\.InProceedings of the 20th SIGNLL Conference on Computational Natural Language Learning,Berlin, Germany,pp\. 280–290\.External Links:[Link](https://aclanthology.org/K16-1028),[Document](https://dx.doi.org/10.18653/v1/K16-1028)Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- P\. Nema, M\. M\. Khapra, A\. Laha, and B\. Ravindran \(2017\)Diversity driven attention model for query\-based abstractive summarization\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vancouver, Canada,pp\. 1063–1072\.External Links:[Link](https://aclanthology.org/P17-1098),[Document](https://dx.doi.org/10.18653/v1/P17-1098)Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1),[§2](https://arxiv.org/html/2605.05392#S2.p1.1),[§2](https://arxiv.org/html/2605.05392#S2.p2.1)\.
- Y\. Ouyang, W\. Li, S\. Li, and Q\. Lu \(2011\)Applying regression models to query\-focused multi\-document summarization\.Information Processing & Management47\(2\),pp\. 227–237\.External Links:ISSN 0306\-4573,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ipm.2010.03.005),[Link](https://www.sciencedirect.com/science/article/pii/S0306457310000257)Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.External Links:1910\.10683Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,Austin, Texas,pp\. 2383–2392\.External Links:[Link](https://aclanthology.org/D16-1264),[Document](https://dx.doi.org/10.18653/v1/D16-1264)Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- A\. M\. Rush, S\. Chopra, and J\. Weston \(2015\)A neural attention model for abstractive sentence summarization\.InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,Lisbon, Portugal,pp\. 379–389\.External Links:[Link](https://aclanthology.org/D15-1044),[Document](https://dx.doi.org/10.18653/v1/D15-1044)Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.
- A\. See, P\. J\. Liu, and C\. D\. Manning \(2017\)Get to the point: summarization with pointer\-generator networks\.External Links:1704\.04368Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- D\. Su, T\. Yu, and P\. Fung \(2021\)Improve query focused abstractive summarization by incorporating answer relevance\.External Links:2105\.12969Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- X\. Wan, J\. Yang, and J\. Xiao \(2007\)Manifold\-ranking based topic\-focused multi\-document summarization\.InProceedings of the 20th International Joint Conference on Artifical Intelligence,IJCAI’07,San Francisco, CA, USA,pp\. 2903–2908\.Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1)\.
- Y\. Xu and M\. Lapata \(2020\)Coarse\-to\-fine query focused multi\-document summarization\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Online,pp\. 3632–3645\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.296),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.296)Cited by:[§1](https://arxiv.org/html/2605.05392#S1.p1.1),[§2](https://arxiv.org/html/2605.05392#S2.p2.1),[§5\.2\.2](https://arxiv.org/html/2605.05392#S5.SS2.SSS2.p1.1),[§6\.2](https://arxiv.org/html/2605.05392#S6.SS2.p3.1),[Table 4](https://arxiv.org/html/2605.05392#S7.T4)\.
- J\. Zhang, Y\. Zhao, M\. Saleh, and P\. Liu \(2020\)PEGASUS: pre\-training with extracted gap\-sentences for abstractive summarization\.InProceedings of the 37th International Conference on Machine Learning,H\. D\. III and A\. Singh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.119,pp\. 11328–11339\.External Links:[Link](https://proceedings.mlr.press/v119/zhang20ae.html)Cited by:[§2](https://arxiv.org/html/2605.05392#S2.p1.1)\.

## Appendix AAppendix

In Table 5, we are showing the generated queries and the original queries of the TD\-QFS data set, where sample number 0 2 denotes the sample belongs to document number 2 of cluster number 0\.

Table 5:Original and the evidence\-based queries from TD\-QFS dataset, where sample number 0 2 indicates that the sample belongs to document number 2 in cluster number 0\.

Similar Articles

Learning to summarize with human feedback

OpenAI Blog

OpenAI demonstrates a technique for improving language model summarization by training a reward model on human preferences and fine-tuning models with reinforcement learning, achieving significant quality improvements that generalize across datasets. This work advances model alignment through human feedback at scale, with applications beyond summarization.