LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English

arXiv cs.CL 05/29/26, 04:00 AM Papers
Summary
LLMBridge introduces an LLM-based pipeline for end-to-end referential bridging resolution, achieving state-of-the-art performance on three English datasets. The system combines heuristic pre/post-processing with LLM natural language inference.
arXiv:2605.29048v1 Announce Type: new Abstract: In this paper, we introduce LLMBridge, a new LLM based system for the task of end-to-end referential bridging resolution in English. Our bridging resolution pipeline combines heuristic pre/post-processing with the natural language inference ability that comes from LLMs. We evaluate our bridging resolution pipeline on 3 datasets which have been used for referential bridging resolution evaluation in English: ISNotes, BASHI, and GUMBridge. Comparison to previous bridging resolution systems shows that the performance of LLMBridge surpasses previous state-of-the-art (SoTA) systems for all 3 datasets in the challenging End-to-end Evaluation Setting, as well as the Basic Bridging Resolution Evaluation Setting (gold bridging anaphor given). We also conduct a thorough error analysis of the LLMBridge performance, examining what varieties of bridging remain difficult for LLM based systems to identify. With this paper, we release the code for the LLMBridge pipeline.
Original Article
View Cached Full Text
Cached at: 05/29/26, 09:15 AM
# An LLM Pipeline for End-to-end Referential Bridging Resolution in English
Source: [https://arxiv.org/html/2605.29048](https://arxiv.org/html/2605.29048)
Amir Zeldes Georgetown University Department of Linguistics \{lel76, amir\.zeldes\}@georgetown\.edu

###### Abstract

In this paper, we introduce LLMBridge, a new LLM based system for the task of end\-to\-end referential bridging resolution in English\. Our bridging resolution pipeline combines heuristic pre/post\-processing with the natural language inference ability that comes from LLMs\. We evaluate our bridging resolution pipeline on 3 datasets which have been used for referential bridging resolution evaluation in English: ISNotes, BASHI, and GUMBridge\. Comparison to previous bridging resolution systems shows that the performance of LLMBridge surpasses previous state\-of\-the\-art \(SoTA\) systems for all 3 datasets in the challenging End\-to\-end Evaluation Setting, as well as the Basic Bridging Resolution Evaluation Setting \(gold bridging anaphor given\)\. We also conduct a thorough error analysis of the LLMBridge performance, examining what varieties of bridging remain difficult for LLM based systems to identify\. With this paper, we release the code for the LLMBridge pipeline\.

LLMBridge: An LLM Pipeline for End\-to\-end Referential Bridging Resolution in English

Lauren Levine and Amir ZeldesGeorgetown UniversityDepartment of Linguistics\{lel76, amir\.zeldes\}@georgetown\.edu

## 1Introduction

Bridging is an anaphoric phenomenon where the referent of a newly introduced entity is inferable due to its relationship with a previously introduced entity\. Consider the following sentences:

\\ex

\. There isa house\.The dooris red\.111Bridging anaphora are marked in bold face, and their associative antecedents are underlined\.

In example[1](https://arxiv.org/html/2605.29048#footnote1)above, we understand the newly introduced entity “the door” to specifically be the door of the aforementioned house, due to both the sequencing of the sentences and the semantic part\-whole relationship that exists between the entities of “house” and “door”\. In this bridging pair, “the door” is referred to as the bridging anaphor, and “a house” is its associative antecedent\. Beyond such part\-whole relations, the associative relationships that give rise to bridging manifest in a variety of different ways, including relative adjectives \(a dog→\\rightarrowa larger/different/other dog\), and prototypical associations \(a library→\\rightarrowthe books\)\. Interpreting such implicit entity relations is necessary for various downstream NLP tasks, such as question\-answering, controllable natural language generation and model output factuality verification\. At present, it is unclear to what extent LLM systems track such implicit entity relations\.

Bridging resolution is the task of automatically detecting bridging anaphora in natural language and resolving them back to their respective associative antecedents\. While the task of bridging resolution has not received as much attention as other anaphoric phenomena like coreference resolution, it has received increased attention in recent years by being included in shared task datasetsKhoslaet al\.\([2021](https://arxiv.org/html/2605.29048#bib.bib5)\); Yuet al\.\([2022](https://arxiv.org/html/2605.29048#bib.bib6)\), as well as independent efforts towards developing bridging datasets and bridging resolution systemsKobayashi and Ng \([2020](https://arxiv.org/html/2605.29048#bib.bib11)\)\. Previous bridging resolution systems have included rule\-basedHouet al\.\([2014](https://arxiv.org/html/2605.29048#bib.bib14)\); Roesigeret al\.\([2018](https://arxiv.org/html/2605.29048#bib.bib15)\), neuralYu and Poesio \([2020](https://arxiv.org/html/2605.29048#bib.bib16)\); Kobayashiet al\.\([2022b](https://arxiv.org/html/2605.29048#bib.bib17)\), and hybrid approachesKobayashiet al\.\([2022a](https://arxiv.org/html/2605.29048#bib.bib12)\)\. However, despite these attempts, bridging resolution has remained an extremely challenging NLP task\. In theEnd\-to\-end Evaluation Setting, SoTA systems do not exceed an F1 score of 40% for anaphor recognition and do not exceed an F1 score of 30% for anaphor resolutionKobayashiet al\.\([2022b](https://arxiv.org/html/2605.29048#bib.bib17),[2023](https://arxiv.org/html/2605.29048#bib.bib13)\)\.

Previous work has shown the identification of bridging instances to be a very difficult task, even for human annotators, due to its highly subjective natureLevine and Zeldes \([2025](https://arxiv.org/html/2605.29048#bib.bib7)\)\. Thus far, there has been little exploration of LLMs’ ability to reliably recognize bridging as a phenomenon\. While recent work has included efforts to provide a benchmark/baseline in limited evaluation settingsBuet al\.\([2025](https://arxiv.org/html/2605.29048#bib.bib9)\); Levine and Zeldes \([2026](https://arxiv.org/html/2605.29048#bib.bib18)\), there has been no previous attempt to leverage LLMs to perform the task of bridging resolution in theEnd\-to\-end Setting\. However, previous work framing bridging anaphora resolution as a QA taskHou \([2020](https://arxiv.org/html/2605.29048#bib.bib10)\), as well as the recent bridging resolution baseline inLevine and Zeldes \([2026](https://arxiv.org/html/2605.29048#bib.bib18)\), suggest that LLM\-based query systems have the potential to improve upon previous bridging resolution systems\.

In this paper, we present LLMBridge, the first LLM based end\-to\-end pipeline for bridging resolution\. We evaluate our bridging resolution pipeline on 3 English datasets for referential bridging: ISNotesMarkertet al\.\([2012](https://arxiv.org/html/2605.29048#bib.bib2)\), BASHIRösiger \([2018](https://arxiv.org/html/2605.29048#bib.bib1)\), and GUMBridgeLevine and Zeldes \([2026](https://arxiv.org/html/2605.29048#bib.bib18)\), and we find that our pipeline surpasses previous SoTA systems in both theEnd\-to\-endandBasicbridging resolution evaluation settings\. We provide code for reproducing the LLMBridge pipeline and evaluation, as well as code for preprocessing the evaluation datasets\.222AnonymizedWe additionally provide detailed error analysis on the performance of LLMBridge, investigating the varieties of bridging that are easy/difficult for LLM based systems to identify\.

## 2Background on Referential Bridging

Roesigeret al\.\([2018](https://arxiv.org/html/2605.29048#bib.bib15)\)introduces the distinction between referential and lexical bridging as a means of describing the differences in the bridging definitions used by English bridging corpora\. ISNotes, BASHI, and GUMBridge all take an information status based definition of bridging and are, as a consequence, exclusively composed of instances of referential bridging\.

Referential bridgingrefers to truly anaphoric instances of bridging where the bridging anaphor requires an antecedent to be interpretable, as in[2](https://arxiv.org/html/2605.29048#S2):

\\ex

\. She likesthe housebecausethe windowsare large\.

On the other hand,lexical bridgingrefers to lexical semantic relations between pairs of entities, such as part\-whole or set\-member relations, which may or may not be anaphoric, as in[2](https://arxiv.org/html/2605.29048#S2)where the antecedent is not strictly necessary for interpretation:

\\ex

\. I went tothe United Stateslast month\. My first stop wasWashington, DC\.

Note that the meaning of “Washington, DC” in the example is recoverable without needing to refer back to “the United States”, though there is a semantic part\-whole relation between the two\.

In this paper, we focus on the task of bridging resolution for referential bridging\.333The ARRAU corpusPoesio and Artstein \([2008](https://arxiv.org/html/2605.29048#bib.bib3)\); Uryupinaet al\.\([2019](https://arxiv.org/html/2605.29048#bib.bib4)\)annotates related mentions that establish entity coherence through non\-identity relations as bridging, rather than using an information status based definition\. As such, ARRAU contains a mix of referential and lexical bridging, and we do not include ARRAU in our evaluation data\.Table[1](https://arxiv.org/html/2605.29048#S2.T1)shows corpus statistics for the following referential bridging corpora: ISNotes, BASHI, and GUMBridge\.

- \*Count includes instances ofmediated/bridgingandmediated/comparative\.

Table 1:Comparison of English referential bridging corpora used for evaluating bridging resolution systems\.
## 3Bridging Resolution

### 3\.1Task Definition

Bridging resolution is the task of automatically recognizing the bridging anaphora in a text and resolving them back to their respective associative antecedents\. The task of bridging resolution can be broken up and described in the following 3 subtasks:

#### Anaphor Recognition

Given the text of a discourse, identify the bridging anaphora present\.

#### Anaphor Resolution

Given a bridging anaphor in a discourse, identify the associative antecedent which makes the referent of the anaphor inferable\. When this subtask is done in isolation, it can also be referred to asantecedent selection\.

#### Subtype Classification

Given a bridging anaphor and antecedent pair in a discourse, select the sub\-variety of bridging \(e\.g\.part\-whole\) from a set of predefined semantic relations \(a specific subtype schema must be assumed\)\.

### 3\.2Evaluation Settings

The evaluation of bridging resolution is commonly carried out in the following 3 settings, each of which allows for a different amount of gold mention information to be present in the input data\. InBasic Bridging Resolution,444This setting has previously been referred to just as “Bridging Resolution”\. We add the designation “Basic” to differentiate it from the general task name\.the system is given gold mention information and gold bridging anaphora, and the task is to resolve each bridging anaphor back to its respective associative antecedent \(also called anaphor resolution/antecedent selection\)\. InFull Bridging Resolution, the system is given gold mention information, and the task is to both identify bridging anaphora and resolve them back to their respective associative antecedents in a discourse\. InEnd\-to\-end Bridging Resolution, the task is the same as in theFull Setting, but the system is only given raw text as input\.

Previous work on bridging resolution has primarily focused on the easier evaluation settings ofBasicandFullbridging resolution, and less attention has been given to the more challengingEnd\-to\-end Setting, despite it being the most realistic evaluation setting\. As such, in this paper we focus on providing scores in the more realistic and challengingEnd\-to\-end Settingfor the English referential bridging resources: ISNotes, BASHI, and GUMBridge\. In this evaluation setting, we report P/R/F1 for Anaphor Recognition and Anaphor Resolution \(the joint recognition of anaphor\-antecedent pair\)\. We also report scores in an altered version of theBasic Setting, providing the gold anaphor as input, but not providing additional gold mention information\. In this evaluation setting, we report Accuracy \(antecedents correctly identified / total count of gold bridging anaphora\)\. In both evaluation settings, we additionally report subtype classification scores for LLMBridge on the GUMBridge corpus\. As GUMBridge allows for multiple subtype annotations, we report both Accuracy for exact match per bridging instance and P/R/F1 on predicting individual subtype annotations\.

We evaluate on the designated test split of GUMBridge, and the full datasets for ISNotes and BASHI \(minus 5 documents from each corpus set aside for prompt development; see Appendix[B](https://arxiv.org/html/2605.29048#A2)\), as they have no test splits\. While we do not provide gold mention information in either of our evaluation settings, we use predicted mention and coreference information obtained from the available Stanza coref models trained on the GUM corpus555gum\-nospeakers\_roberta\-large\-loraZeldes \([2017](https://arxiv.org/html/2605.29048#bib.bib42)\)\(for GUMBridge\) and the OntoNotes corpus666ontonotes\-singletons\_roberta\-large\-loraWeischedelet al\.\([2011](https://arxiv.org/html/2605.29048#bib.bib43)\)\(for ISNotes and BASHI\), which predict both coreferent and singleton mentions\.

## 4Previous SoTA Bridging Resolution Systems

Since the rise in prominence of neural approaches following the introduction of Transformer modelsDevlinet al\.\([2019](https://arxiv.org/html/2605.29048#bib.bib37)\); Vaswaniet al\.\([2017](https://arxiv.org/html/2605.29048#bib.bib38)\), there have been a number of attempts to address the task of bridging resolution with neural systems\.

Kobayashiet al\.\([2022b](https://arxiv.org/html/2605.29048#bib.bib17)\)give an overview of the performance of recent bridging resolution systems evaluated on ISnotes and BASHI in the more challengingEnd\-to\-end Setting, in addition to the more commonly reportedFull Setting\. Models adapted and evaluated in these experiments includeRoesigeret al\.\([2018](https://arxiv.org/html/2605.29048#bib.bib15)\),Yu and Poesio \([2020](https://arxiv.org/html/2605.29048#bib.bib16)\), andKobayashi and Ng \([2021](https://arxiv.org/html/2605.29048#bib.bib40)\)\. Most recently,Kobayashiet al\.\([2023](https://arxiv.org/html/2605.29048#bib.bib13)\)presentPairSpanBERT, a pretrained model for bridging resolution based onSpanBERTJoshiet al\.\([2020](https://arxiv.org/html/2605.29048#bib.bib41)\)\. ExtendingKobayashiet al\.\([2022b](https://arxiv.org/html/2605.29048#bib.bib17)\)and other previous systems by leveragingPairSpanBERT, they evaluate on BASHI and ISNotes for both theEnd\-to\-end SettingandFull Setting\. The system achieves near SoTA performance for theFull Setting, with SoTA performance for theEnd\-to\-end Setting\.

Taking a different approach,Hou \([2020](https://arxiv.org/html/2605.29048#bib.bib10)\)presents a system that re\-frames the task of bridging resolution as a question answering task based on context\.[Hou](https://arxiv.org/html/2605.29048#bib.bib10)’s BARQA \(QA for Bridging Anaphora Resolution\) system is designed to take in a segment of text and a question about an existing bridging anaphor in that text segment as inputs, and return the corresponding associative antecedent\. Using this methodology,[Hou](https://arxiv.org/html/2605.29048#bib.bib10)reports SoTA performance on the ISNotes and BASHI corpora for theBasic Bridging Resolution Setting\.

To serve as a comparison to the performance of LLMBridge, reported scores for the SoTA systems discussed above are shown in Table[2](https://arxiv.org/html/2605.29048#S6.T2)for theEnd\-to\-end Setting, and Table[3](https://arxiv.org/html/2605.29048#S6.T3)for theBasic Setting\.

## 5LLMBridge Pipeline

LLMBridge is an LLM based pipeline for the task of bridging resolution\. It combines heuristic pre/post processing with backend LLM queries to create a robust bridging resolution system for referential bridging\. The pipeline handles 3 subtasks of bridging resolution: \(1\) anaphor recognition, \(2\) anaphor resolution, and \(3\) subtype classification\. Our system provides code to run the subtasks individually \(as in theBasic Bridging Resolution Setting\), where gold input is given for the antecedent selection and subtype classification tasks, or as a full end\-to\-end pipeline, where the output of one task is used as the input for the next \(as in theEnd\-to\-end Setting\)\.

### 5\.1Subtask Operationalization

The following describes how we operationalize the subtasks of bridging resolution to be accomplished via LLM query\. In our prompt design, we follow an information status based, referential definition of bridging, where a bridging anaphor must be a newly introduced entity, and it must be inferable due to an anaphoric, associative \(non\-identity\) relation to a previous entity in the discourse\. Complete prompt templates are included in Appendix[B](https://arxiv.org/html/2605.29048#A2)\. The buffer/context window sizes described in the subtask operationalizations below are all a configurable part of the preprocessing\.

#### Anaphor Recognition

For each sentence in the input text, the backend LLM is queried and asked to identify any bridging anaphora in the sentence\. The query provides a definition of bridging anaphora, instructions for the anaphor recognition task, examples, the text of the sentence being asked about with up to 5 tokens of context buffer to the left and right of the sentence, and a back context of up to 150 tokens\.777An analysis of the train/dev partitions has shown ¿ 90% of associative antecedents fall within that window\.The LLM is instructed to return a list of bridging anaphora\.

#### Anaphor Resolution

For each bridging anaphor predicted by the system, the backend LLM is queried and asked to identify the associative antecedent\. The query provides a definition of bridging anaphora, instructions for the antecedent selection task, examples, and the text of the sentence containing the predicted anaphor \(marked in double curly brackets: \{\{ \}\}\) with up to 150 tokens of context to the left of the sentence\. The LLM is expected to return the identified antecedent entity, or if no antecedent is identified, it is instructed to return the string “no antecedent”\. In theBasic Bridging Resolution Setting, the anaphor resolution subtask is run with the sentence containing the gold anaphor and 150 tokens of back context as input\.

#### Subtype Classification \(GUMBridge only\)

For each bridging anaphor\-antecedent pair predicted by the system \(candidate anaphora where “no antecedent” is predicted are filtered out\), the backend LLM is asked to classify the bridging subtype of the pair\. The query provides a definition of bridging anaphora, instructions for the subtype classification task \(including details on the 11 bridging subtypes in the GUMBridge annotation schema888Details of the subtype varieties in GUMBridge are included in Appendix[A](https://arxiv.org/html/2605.29048#A1)\), examples, and the sentence containing the anaphor with 150 tokens of back context \(with the anaphor marked in double curly brackets: \{\{ \}\} and the antecedent marked in asterisks: \* \*\)\. In theBasic Bridging Resolution Setting, the LLM receives the individual sentences containing the anaphor and antecedent with 10 tokens of buffer context to the left and right\. The LLM is instructed to return all of the bridging subtypes applicable to the pair\.

### 5\.2LLM Model Selection

By design, any LLM can serve as the backend to be queried in the LLMBridge pipeline\. We provide scores for one closed SoTA model,Gemini\-3\.1\-pro\-preview, and two smaller open models,meta\-llama/llama\-3\.3\-70b\-instructandQwen2\.5\-7B\-Instruct, in order to demonstrate the competence of different sized language models on bridging resolution\. We also provide scores forQwen2\.5\-7B\-Instruct\-FT, a version ofQwen2\.5\-7B\-Instructwhich we fine\-tune using low\-rank adapters \(LoRA;Huet al\.,[2022](https://arxiv.org/html/2605.29048#bib.bib45)\) on the GUMBridge train data \(213k tokens, 4k bridging instances\), which is the largest English corpus for referential bridging and includes annotations for bridging subtypes\. We fine\-tune the model for 3 epochs using a learning rate of 3e\-4, with low\-rank adapters of rank 8 and a LoRA alpha of 16\. We release theQwen2\.5\-7B\-Instruct\-FTbridging fine\-tuned model with the LLMBridge pipeline code\. We include this model in our experiments to investigate the gains that fine\-tuning can bring for smaller LLMs, even with limited data and compute\.

### 5\.3Pre/Post Processing

To boost the performance of our pipeline, we include a number of heuristics which are based on additional linguistic annotations, including coreference/mention, part\-of\-speech, and dependency syntax \(UD;de Marneffeet al\.,[2021](https://arxiv.org/html/2605.29048#bib.bib47)\)\. If the additional linguistic annotations are not available, predicted annotations can be used \(we provide code to enable this within our pipeline, leveraging Stanza,Qiet al\.[2020](https://arxiv.org/html/2605.29048#bib.bib46)\), or the pipeline can run without the pre/post processing\. We report scores using predicted annotations from Stanza during the pre/post processing for theEnd\-to\-endandBasicevaluation settings\.

#### Adjusting Responses to Attested Mention Spans

We adjust the span of the LLM’s predicted response to match an actual, attested \(or Stanza predicted\) mention span for the data in order to improve precision\. We adjust the predicted anaphor returned by the LLM to be the shortest existing mention span which contains the predicted anaphor\. If there is no such existing mention, we check whether there are any existing mention spans which are contained within the predicted anaphor, and, if so, adjust the anaphor span to be the longest such mention\. If there is still no such mention, the predicted anaphor is rejected\. We also do this adjustment of the predicted span for the antecedent selection\.

#### Suggesting Candidate Bridging Anaphora

In order to improve recall, we also use heuristics to identify possible candidates for bridging anaphora and make additional queries to the LLM backend to get specific judgments on them\. The entities we highlight include entities containing comparative adjectives \(e\.g\., “a smaller dog”\), entities containing any keyword from a list of relative or temporal markers \(e\.g\., “another”, “others”, “different”, “following”, “yesterday”, etc\.\), and any two\-token entity where the first token is a definite determiner \(e\.g, “the door”\)\.

#### Filtering Coreference from Predicted Bridging Anaphora

In the anaphoric, information status based definition of referential bridging, a bridging anaphor must be a newly introduced entity in the discourse\. As such, any entity with a previous mention is structurally prohibited from being a bridging anaphor\. Therefore, in order to improve precision on referential bridging, we filter out any predicted bridging anaphora that are subsequent mentions in a coreference chain\.

## 6Results

We evaluate LLMBridge on the aforementioned English referential bridging datasets: ISNotes, BASHI, and GUMBridge\. We provide single run scores on the datasets for theEnd\-to\-end Bridging Resolution Settingin Table[2](https://arxiv.org/html/2605.29048#S6.T2), and the alteredBasic Bridging Resolution Setting\(gold bridging anaphor, but no gold mention information\) in Table[3](https://arxiv.org/html/2605.29048#S6.T3)\. When available, we also include previous SoTA bridging resolution scores reported on the datasets for comparison\. In Table[4](https://arxiv.org/html/2605.29048#S6.T4), we provide bridging subtype classification scores on GUMBridge in both evaluation settings\. The highest scores for the F1 and Accuracy metrics are shown inbold\.

Table 2:LLMBridge and previous system results on evaluation datasets in theEnd\-to\-end Setting\.Table 3:LLMBridge and previous system results on evaluation datasets for antecedent selection accuracy in theBasic Setting\.Table 4:LLMBridge subtype classification results on GUMBridge inEnd\-to\-endandBasicevaluation settings\.Overall, we see that leveragingGemini\-3\.1\-pro\-previewas the LLM backend of the pipeline yields the highest scores across all datasets and evaluation settings for LLMBridge\. This is expected asGemini\-3\.1\-pro\-previewis by far the largest model tested here, though it is notable that the scores are still not high\. We observe an overall increase in scores moving from the results ofQwen2\.5\-7B\-Instructto that ofLlama\-3\.3\-70B\-Instructto that ofGemini\-3\.1\-pro\-preview\. This demonstrates the expected trend that larger, more powerful chain\-of\-thought \(CoT\) reasoning models give stronger performance on high level discourse tasks like bridging resolution\.

When comparing the different evaluation settings, we expect theEnd\-to\-end Bridging Resolution Settingto be more challenging than theBasic Bridging Resolution Setting\. We see this reflected in the performance of LLMBridge across the different models and datasets, with scores increasing from theEnd\-to\-endtoBasicSetting\. Taking theGemini\-3\.1\-pro\-previewGUMBridge scores as an illustrative example, we can see that the accuracy score for antecedent selection in theBasic Settingis 56\.6, while the F1 scores for anaphor recognition and resolution are 50\.1 and 31\.3 in theEnd\-to\-end Setting\. The scores for other datasets and models follow the same pattern\.

Comparing the best performance of LLMBridge \(Gemini\-3\.1\-pro\-previewbackend\) to the reported scores from previous systems, we see that we achieve SoTA results on all datasets in both theEnd\-to\-endandBasicevaluation settings\. In Figure[1](https://arxiv.org/html/2605.29048#S6.F1), we provide a comparison of the LLMBridge scores \(Geminibackend\) to previous SoTA results on the referential bridging datasets\. For BASHI, we achieve SoTA results in theBasic Settingwith an antecedent selection accuracy of 56\.7 \(\+18\.0 previous SoTA\) and in theEnd\-to\-end Settingwith an anaphor recognition F1 of 34\.5 \(\+2\.2\) and an anaphor resolution F1 of 21\.1 \(\+2\.8\)\. For ISNotes, we achieve SoTA results in theBasic Settingwith an antecedent selection accuracy of 58\.1 \(\+8\.0\), and in theEnd\-to\-end Settingwith an anaphor recognition F1 of 45\.6 \(\+5\.7\) and an anaphor resolution F1 of 29\.8 \(\+3\.6\)\. For GUMBridge we achieve SoTA on antecedent selection accuracy in theBasic Settingat 56\.6 \(\+7\.0\)\. We also provide the firstEnd\-to\-endbridging resolution scores on the GUMBridge dataset\. The anaphor recognition and anaphor resolution F1\-Scores we report for GUMBridge, 50\.1 and 31\.3 respectively, are the highest F1\-Scores ever reported on any English bridging dataset in theEnd\-to\-end Setting\.

When we compare between the LLMBridge results forQwen2\.5\-7B\-InstructandQwen2\.5\-7B\-Instruct\-FT, we see that there is a sizable increase in performance across evaluation settings and datasets\. In theEnd\-to\-end Settingwe see an avg\.Δ\\DeltaF1 across datasets of \+8\.0 for anaphor recognition and \+7\.8 for anaphor resolution\. In theBasic Settingthere is an avg\.Δ\\Deltaaccuracy of \+17\.0 for anaphor resolution\. We also see that this jump in performance is large enough to be competitive across datasets with the performance of the largerLlama\-3\.3\-70B\-Instructin theEnd\-to\-end Setting\. With regard to subtype classification \(GUMBridge only\), we also see an increase in performance, with aΔ\\DeltaF1 of \+11\.4 in theEnd\-to\-end Settingand aΔ\\DeltaF1 of \+24\.6 in theBasic Setting\. This increase in performance across subtasks illustrates the utility of fine\-tuning an LLM for bridging resolution task, even when the training data and compute resources are limited\. If resource availability prevents simply scaling up model size, fine\-tuning a smaller model can still result in meaningful gains\.

![Refer to caption](https://arxiv.org/html/2605.29048v1/comparison2.png)Figure 1:LLMBridge \(Gemini Backend\) vs\. Previous SoTA Results on Referential Bridging DatasetsFinally, looking at the GUMBridge subtype classification scores in Table[4](https://arxiv.org/html/2605.29048#S6.T4), we see that the performance is considerably higher in the easierBasic Settingthan in the harderEnd\-to\-end Settingwhere classification is a downstream task\. As GUMBridge allows for multiple subtype annotations on an instance of bridging, we report P/R/F1 on detection of individual subtype labels\. In theBasic Setting, we achieve an F1 score of 78\.5\. For exact match on subtype labels, we achieve an accuracy score of 76\.3 in theBasic Setting\. These are the first subtype classification scores for GUMBridge reporting both P/R/F1 and accuracy, and, overall, such scores indicate moderate success for the task of bridging subtype classification\.

## 7Error Analysis

In this section, we conduct error analysis on the predictions from the best performing configuration of the LLMBridge pipeline \(Gemini\-3\.1\-pro\-previewbackend\)\. We conduct this error analysis in order to determine what sub\-varieties of referential bridging remain difficult for an LLM based bridging resolution system like LLMBridge\. We explore the influence of anaphor\-antecedent pair distance and bridging subtype on the ability of the LLMBridge pipeline to identify instances of bridging anaphora\. We additionally examine LLMBridge’s performance on subtype categorization, looking at which subtypes are easier/harder for the system to classify\. As GUMBridge is the only dataset with subtype labels, the subtype related investigations exclusively analyze the predictions on GUMBridge test\. Details on the GUMBridge subtype categorization schema are given in Appendix[A](https://arxiv.org/html/2605.29048#A1)\.

![Refer to caption](https://arxiv.org/html/2605.29048v1/distance_density_error_type.png)Figure 2:Distribution of anaphor\-antecedent distances for bridging pairs for pairs with error types of False Negative \(red\), False Positive \(green\), and True Positive \(blue\) for LLMBridge \(Gemini backend\) performance on anaphor recognition\.In Figure[2](https://arxiv.org/html/2605.29048#S7.F2), we show the distribution of anaphor\-antecedent distances for bridging pairs in ISNotes, BASHI, and GUMBridge grouped by whether LLMBridge correctly identified the bridging anaphor \(True Positive; TP\), failed to identify the bridging anaphor \(False Negative; FN\), or mistakenly identified a non\-bridging entity as a bridging anaphor \(False Positive; FP\)\. Because the context window given to LLMBridge for anaphor detection and antecedent selection was only 150 tokens \(\> 90% of bridging instances in the data are < 150 tokens apart\), we first filter out long distance instances of bridging more than 150 tokens apart\. Looking at the density curves in Figure[2](https://arxiv.org/html/2605.29048#S7.F2), we see that a larger part of the density curve for False Negatives covers higher token distances when compared with the curves for True Positives and False Positives\. Kolmogorov\-Smirnov tests comparing the FN distribution with the TP and FP distributions confirm that LLMBridge is more likely to fail to identify instances of bridging when the anaphor and antecedent are further apart \(Holm adjusted p\-values both < 0\.01\)\. In other words, instances of long distance bridging are more difficult for the system to identify\. It is notable that this trend is apparent even when constrained to examining instances of bridging that occur within 150 tokens\. We also note the FP distribution is not significantly different from the TP distribution, indicating that although FP predictions are wrong, they are being made in a reasonable distance range\.

![Refer to caption](https://arxiv.org/html/2605.29048v1/ana_subtype.png)Figure 3:χ2\\chi^\{2\}residuals: True Positive/False Negative LLMBridge predictions \(Geminibackend\) on anaphor recognition and gold bridging subtype label\. \(χ2\\chi^\{2\}= 28\.899, df = 10, p < 0\.01\)In Figure[3](https://arxiv.org/html/2605.29048#S7.F3), we show an association plot of the residuals from aχ2\\chi^\{2\}test for gold bridging instances that were correctly identified \(TP\) or missed \(FN\) in LLMBridge’s performance on the task of anaphor recognition and the gold subtype labels of those bridging instances\. Looking at the TP row of the association plot, we can see that LLMBridge is more able to identify bridging anaphora labeled withcomparison\-relativeorentity\-meronomy\. Thecomparison\-relativeinstances being easier to identify likely reflects the tendency for this subtype to have overt markers \(e\.g\., comparative markers such as “other” or “another”\)\.entity\-meronomymay be easier to recognize due to the part\-whole relations being more lexically inferable from the entities alone \(e\.g\.,a shady garden bed→the soil\)\. Looking at the FP row, we can see that LLMBridge is more likely to miss bridging instances labeled withentity\-associative,entity\-property, orset\-member\. It is unsurprising that theentity\-associativesubtype is difficult, as associative entity relations comprise the broadest sub\-variety, covering a variety of implicit relations which lack overt markers, such as relational nouns \(e\.g\.,a business→the customer\), implicit arguments \(e\.g\.,a murder→the victim\), and prototypical associations \(e\.g\.,a wedding→the reception\)\. For the subtypesentity\-propertyandset\-member, the relations may also be more abstract \(e\.g\., intangible qualities and class\-instance relations\) or more dependent on a wider context for understanding \(i\.e\., it may be difficult to infer a set relation just from the anaphor itself\)\.

![Refer to caption](https://arxiv.org/html/2605.29048v1/cm_gemini_trimmed.png)Figure 4:Confusion matrix of LLMBridge \(Geminibackend\) predicted subtype labels and gold labels\.In Figure[4](https://arxiv.org/html/2605.29048#S7.F4), we provide a confusion matrix of the bridging subtype labels assigned by LLMBridge \(Gemini\-3\.1\-pro\-previewbackend\) and the gold labels from the test set of GUMBridge\. In order to have more data to analyze, we consider the predictions fromBasicevaluation setting, where the gold bridging pair is provided\. Looking at Figure[4](https://arxiv.org/html/2605.29048#S7.F4), we see that while there is overlap for a variety of different subtypes, most confusions occur only 5 or less times\. If we look at thenonerow of the gold labels axis, we can see the counts of instances where LLMBridge extraneously generated an additional label\. We can see that these hallucinated labels are primarilycomparisonrelations\. And while it is somewhat common forcomparisonlabels to co\-occur with other subtype labels \(e\.g\.,a book→another onebeing an instance of bothcomparison\-relativeandcomparison\-sense\), it appears that LLMBridge over\-generates such cases when only a single label is actually applicable\. Looking at Figure[4](https://arxiv.org/html/2605.29048#S7.F4), we can also see that the subtypes with the greatest overlap areentity\-resultativeandentity\-associative, with 23 instances ofresultativebeing mistaken by the system for instances ofassociative\. Theresultativesubtype is narrow in scope, focusing specifically on causal/transformational relations between entities, which are frequently seen in contexts involving product\-producing processes, such as cooking/baking \(e\.g\.,some flour→the bread,Fanget al\.[2022](https://arxiv.org/html/2605.29048#bib.bib49)\), but not common in most contexts\. The fact that instances of this subtype are uniformly predicted as the broaderassociativesubtype indicates thatresultativelikely needs to be further detailed with more examples in the subtype classification prompt\.

## 8Conclusion

In this paper, we present LLMBridge, the first LLM based pipeline for bridging resolution\. We evaluate our bridging resolution system in the challengingEnd\-to\-endevaluation setting, along with theBasicevaluation setting\. We report scores on all 3 of the referential bridging evaluation datasets for English: ISNotes, BASHI, and GUMBridge\. We additionally provide the first set of scores in theEnd\-to\-Endevaluation setting for the GUMBridge corpus on bridging resolution and bridging subtype classification\. We achieve new SoTA results on all 3 datasets in both theEnd\-to\-endandBasicevaluation settings\. We also provide error analysis of LLMBridge’s performance on the evaluation data, finding that the system struggles to identify long distance instances of bridging as well as more abstract bridging subtypes, likeentity\-associative\. Overall, LLMBridge produces strong results on all 3 of the referential bridging corpora for English, achieving new SoTA results across the board, including the highest ever scores reported on any English dataset for anaphor recognition and anaphor resolution in theEnd\-to\-end Setting\.

## Limitations

This paper focuses on creating a bridging resolution pipeline for English and does not explore the cross\-linguistic differences in how bridging manifests\. Future work is required to determine to what extent this pipeline could directly apply or be adapted for bridging in a multilingual setting\. Additionally, the design of this pipeline focuses specifically on the anaphoric, information status based definition of bridging known as referential bridging\. Further adaptation of the pipeline would be necessary to increase performance on lexical bridging\. Also, results from previous work are reported rather than reproduced due to incomplete or unreproducible software repositories\. This means we were not able to obtain p\-values for differences between our results and previous numbers\.

We also acknowledge that the cost of running the LLMBridge pipeline with a SoTA commercial LLM backend \(such asGemini\-3\.1\-pro\-preview\) is a significant expense, though what is an expensive SoTA model today is likely to rapidly reduce in cost, based on past experience\. The cost of the evaluation conducted in this paper was approximately 500 USD, nearly all of which was for the results obtained from running LLMBridge withGemini\-3\.1\-pro\-preview, which is a very new and expensive model\. These costs limit both our ability to conduct multiple runs for evaluation and the ability for future work to reproduce our results, at least until a corresponding model becomes more affordable\. While further exploration is required to find a balancing point between backend model cost and pipeline performance, results from this study indicate that LoRA fine\-tuning a mid\-sized LLM may be a promising direction, and the advantages of the largest available model at the time of writing are reflected in the scores we report in our results\.

## References

- L\. Bu, L\. Levine, and A\. Zeldes \(2025\)DiscoTrack: a multilingual llm benchmark for discourse tracking\.arXiv preprint arXiv:2510\.17013\.Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p5.1)\.
- M\. de Marneffe, C\. D\. Manning, J\. Nivre, and D\. Zeman \(2021\)Universal dependencies\.Computational Linguistics47\(2\),pp\. 255–308\.External Links:ISSN 0891\-2017,[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00402),[Link](https://doi.org/10.1162/coli_a_00402),https://direct\.mit\.edu/coli/article\-pdf/47/2/255/1938138/coli\_a\_00402\.pdfCited by:[§5\.3](https://arxiv.org/html/2605.29048#S5.SS3.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423/),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§4](https://arxiv.org/html/2605.29048#S4.p1.1)\.
- B\. Fang, T\. Baldwin, and K\. Verspoor \(2022\)What does it take to bake a cake? the RecipeRef corpus and anaphora resolution in procedural text\.InFindings of the Association for Computational Linguistics: ACL 2022,S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 3481–3495\.External Links:[Link](https://aclanthology.org/2022.findings-acl.275/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-acl.275)Cited by:[§7](https://arxiv.org/html/2605.29048#S7.p4.1)\.
- Y\. Hou, K\. Markert, and M\. Strube \(2014\)A rule\-based system for unrestricted bridging resolution: recognizing bridging anaphora and finding links to antecedents\.InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),A\. Moschitti, B\. Pang, and W\. Daelemans \(Eds\.\),Doha, Qatar,pp\. 2082–2093\.External Links:[Link](https://aclanthology.org/D14-1222/),[Document](https://dx.doi.org/10.3115/v1/D14-1222)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p4.1)\.
- Y\. Hou \(2020\)Bridging anaphora resolution as question answering\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 1428–1438\.External Links:[Link](https://aclanthology.org/2020.acl-main.132/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.132)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p5.1),[§4](https://arxiv.org/html/2605.29048#S4.p3.1),[Table 3](https://arxiv.org/html/2605.29048#S6.T3.1.1.8.7.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.ICLR1\(2\),pp\. 3\.Cited by:[§5\.2](https://arxiv.org/html/2605.29048#S5.SS2.p1.1)\.
- M\. Joshi, D\. Chen, Y\. Liu, D\. S\. Weld, L\. Zettlemoyer, and O\. Levy \(2020\)SpanBERT: improving pre\-training by representing and predicting spans\.Transactions of the Association for Computational Linguistics8,pp\. 64–77\.External Links:[Link](https://aclanthology.org/2020.tacl-1.5/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00300)Cited by:[§4](https://arxiv.org/html/2605.29048#S4.p2.1)\.
- S\. Khosla, J\. Yu, R\. Manuvinakurike, V\. Ng, M\. Poesio, M\. Strube, and C\. Rosé \(2021\)The CODI\-CRAC 2021 shared task on anaphora, bridging, and discourse deixis in dialogue\.InProceedings of the CODI\-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue,S\. Khosla, R\. Manuvinakurike, V\. Ng, M\. Poesio, M\. Strube, and C\. Rosé \(Eds\.\),Punta Cana, Dominican Republic,pp\. 1–15\.External Links:[Link](https://aclanthology.org/2021.codi-sharedtask.1/),[Document](https://dx.doi.org/10.18653/v1/2021.codi-sharedtask.1)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p4.1)\.
- H\. Kobayashi, Y\. Hou, and V\. Ng \(2022a\)Constrained multi\-task learning for bridging resolution\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 759–770\.External Links:[Link](https://aclanthology.org/2022.acl-long.56/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.56)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p4.1)\.
- H\. Kobayashi, Y\. Hou, and V\. Ng \(2022b\)End\-to\-end neural bridging resolution\.InProceedings of the 29th International Conference on Computational Linguistics,N\. Calzolari, C\. Huang, H\. Kim, J\. Pustejovsky, L\. Wanner, K\. Choi, P\. Ryu, H\. Chen, L\. Donatelli, H\. Ji, S\. Kurohashi, P\. Paggio, N\. Xue, S\. Kim, Y\. Hahm, Z\. He, T\. K\. Lee, E\. Santus, F\. Bond, and S\. Na \(Eds\.\),Gyeongju, Republic of Korea,pp\. 766–778\.External Links:[Link](https://aclanthology.org/2022.coling-1.64/)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p4.1),[§4](https://arxiv.org/html/2605.29048#S4.p2.1)\.
- H\. Kobayashi, Y\. Hou, and V\. Ng \(2023\)PairSpanBERT: an enhanced language model for bridging resolution\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 6931–6946\.External Links:[Link](https://aclanthology.org/2023.acl-long.383/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.383)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p4.1),[§4](https://arxiv.org/html/2605.29048#S4.p2.1),[Table 2](https://arxiv.org/html/2605.29048#S6.T2.1.1.14.13.1),[Table 2](https://arxiv.org/html/2605.29048#S6.T2.1.1.20.19.1)\.
- H\. Kobayashi and V\. Ng \(2020\)Bridging resolution: a survey of the state of the art\.InProceedings of the 28th International Conference on Computational Linguistics,D\. Scott, N\. Bel, and C\. Zong \(Eds\.\),Barcelona, Spain \(Online\),pp\. 3708–3721\.External Links:[Link](https://aclanthology.org/2020.coling-main.331/),[Document](https://dx.doi.org/10.18653/v1/2020.coling-main.331)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p4.1)\.
- H\. Kobayashi and V\. Ng \(2021\)Bridging resolution: making sense of the state of the art\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tur, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),Online,pp\. 1652–1659\.External Links:[Link](https://aclanthology.org/2021.naacl-main.131/),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.131)Cited by:[§4](https://arxiv.org/html/2605.29048#S4.p2.1)\.
- L\. Levine and A\. Zeldes \(2025\)Subjectivity in the annotation of bridging anaphora\.InProceedings of the 19th Linguistic Annotation Workshop \(LAW\-XIX\-2025\),S\. Peng and I\. Rehbein \(Eds\.\),Vienna, Austria,pp\. 48–59\.External Links:[Link](https://aclanthology.org/2025.law-1.4/),[Document](https://dx.doi.org/10.18653/v1/2025.law-1.4),ISBN 979\-8\-89176\-262\-6Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p5.1)\.
- L\. Levine and A\. Zeldes \(2026\)GUMBridge: A corpus for varieties of bridging anaphora\.InProceedings of the Fifteenth Language Resources and Evaluation Conference \(LREC 2026\),S\. Piperidis, N\. Bel, H\. van den Heuvel, N\. Ide, S\. Krek, and A\. Toral \(Eds\.\),Palma, Mallorca, Spain,pp\. 6823–6837\.External Links:[Document](https://dx.doi.org/10.63317/3sf73k63vuww)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p5.1),[§1](https://arxiv.org/html/2605.29048#S1.p6.1),[Table 3](https://arxiv.org/html/2605.29048#S6.T3.1.1.7.6.1)\.
- K\. Markert, Y\. Hou, and M\. Strube \(2012\)Collective classification for fine\-grained information status\.InProceedings of the 50th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),H\. Li, C\. Lin, M\. Osborne, G\. G\. Lee, and J\. C\. Park \(Eds\.\),Jeju Island, Korea,pp\. 795–804\.External Links:[Link](https://aclanthology.org/P12-1084/)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p6.1)\.
- M\. Poesio and R\. Artstein \(2008\)Anaphoric annotation in the ARRAU corpus\.InProceedings of the Sixth International Conference on Language Resources and Evaluation \(LREC’08\),N\. Calzolari, K\. Choukri, B\. Maegaard, J\. Mariani, J\. Odijk, S\. Piperidis, and D\. Tapias \(Eds\.\),Marrakech, Morocco\.External Links:[Link](https://aclanthology.org/L08-1091/)Cited by:[footnote 3](https://arxiv.org/html/2605.29048#footnote3)\.
- P\. Qi, Y\. Zhang, Y\. Zhang, J\. Bolton, and C\. D\. Manning \(2020\)Stanza: a python natural language processing toolkit for many human languages\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations,A\. Celikyilmaz and T\. Wen \(Eds\.\),Online,pp\. 101–108\.External Links:[Link](https://aclanthology.org/2020.acl-demos.14/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-demos.14)Cited by:[§5\.3](https://arxiv.org/html/2605.29048#S5.SS3.p1.1)\.
- I\. Roesiger, A\. Riester, and J\. Kuhn \(2018\)Bridging resolution: task definition, corpus resources and rule\-based experiments\.InProceedings of the 27th International Conference on Computational Linguistics,E\. M\. Bender, L\. Derczynski, and P\. Isabelle \(Eds\.\),Santa Fe, New Mexico, USA,pp\. 3516–3528\.External Links:[Link](https://aclanthology.org/C18-1298/)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p4.1),[§2](https://arxiv.org/html/2605.29048#S2.p1.1),[§4](https://arxiv.org/html/2605.29048#S4.p2.1)\.
- I\. Rösiger \(2018\)BASHI: a corpus of Wall Street Journal articles annotated with bridging links\.InProceedings of the Eleventh International Conference on Language Resources and Evaluation \(LREC 2018\),N\. Calzolari, K\. Choukri, C\. Cieri, T\. Declerck, S\. Goggi, K\. Hasida, H\. Isahara, B\. Maegaard, J\. Mariani, H\. Mazo, A\. Moreno, J\. Odijk, S\. Piperidis, and T\. Tokunaga \(Eds\.\),Miyazaki, Japan\.External Links:[Link](https://aclanthology.org/L18-1058/)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p6.1)\.
- O\. Uryupina, R\. Artstein, A\. Bristot, F\. Cavicchio, F\. Delogu, K\. J\. Rodríguez, and M\. Poesio \(2019\)Annotating a broad range of anaphoric phenomena, in a variety of genres: the arrau corpus\.Natural Language Engineering26,pp\. 95 – 128\.External Links:[Link](https://api.semanticscholar.org/CorpusID:164858637)Cited by:[footnote 3](https://arxiv.org/html/2605.29048#footnote3)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§4](https://arxiv.org/html/2605.29048#S4.p1.1)\.
- R\. Weischedel, S\. Pradhan, L\. Ramshaw, M\. Palmer, N\. Xue, M\. Marcus, A\. Taylor, C\. Greenberg, E\. Hovy, R\. Belvin,et al\.\(2011\)OntoNotes release 4\.0\.LDC2011T03, Philadelphia, Penn\.: Linguistic Data Consortium17\.Cited by:[§3\.2](https://arxiv.org/html/2605.29048#S3.SS2.p3.1)\.
- J\. Yu, S\. Khosla, R\. Manuvinakurike, L\. Levin, V\. Ng, M\. Poesio, M\. Strube, and C\. Rosé \(2022\)The CODI\-CRAC 2022 shared task on anaphora, bridging, and discourse deixis in dialogue\.InProceedings of the CODI\-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue,J\. Yu, S\. Khosla, R\. Manuvinakurike, L\. Levin, V\. Ng, M\. Poesio, M\. Strube, and C\. Rose \(Eds\.\),Gyeongju, Republic of Korea,pp\. 1–14\.External Links:[Link](https://aclanthology.org/2022.codi-crac.1/)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p4.1)\.
- J\. Yu and M\. Poesio \(2020\)Multitask learning\-based neural bridging reference resolution\.InProceedings of the 28th International Conference on Computational Linguistics,D\. Scott, N\. Bel, and C\. Zong \(Eds\.\),Barcelona, Spain \(Online\),pp\. 3534–3546\.External Links:[Link](https://aclanthology.org/2020.coling-main.315/),[Document](https://dx.doi.org/10.18653/v1/2020.coling-main.315)Cited by:[§1](https://arxiv.org/html/2605.29048#S1.p4.1),[§4](https://arxiv.org/html/2605.29048#S4.p2.1)\.
- A\. Zeldes \(2017\)The GUM corpus: creating multilayer resources in the classroom\.Language Resources and Evaluation51\(3\),pp\. 581–612\.External Links:[Document](https://dx.doi.org/http%3A//dx.doi.org/10.1007/s10579-016-9343-x)Cited by:[§3\.2](https://arxiv.org/html/2605.29048#S3.SS2.p3.1)\.

## Appendix AGUMBridge Bridging Subtypes

This appendix briefly details the bridging subtype varieties annotated in the GUMBridge corpus\. Figure[5](https://arxiv.org/html/2605.29048#A1.F5)shows the proportions and raw counts of the subtype annotations in the GUMBridge corpus\.

![Refer to caption](https://arxiv.org/html/2605.29048v1/subtype_pie.png)Figure 5:Bridging subtype proportions in the GUMBridge corpus\. Raw counts for each subtype annotation are shown in the Figure\.#### comparison\-relative

The anaphor is preceded by a comparative marker which implies a comparison to the antecedent \(e\.g\.,several women→other women\)\.

#### comparison\-sense

The type of the anaphor is omitted but inferable via comparison to the antecedent \(e\.g\.,a Chinese restaurant→the Italian one\)\.

#### comparison\-time

The anaphor refers to a specific time/time frame which is understandable with reference to the time/time frame expressed by the antecedent \(e\.g\.,Wednesday→yesterday\)\.

#### entity\-meronomy

The anaphor has a part\-whole relation with the antecedent, including physical subparts, substance\-portion, and regions/subsections \(e\.g\.,a house→the door\)\.

#### entity\-property

The anaphor is a physical or intangible property of the antecedent, such as smell, length, or style \(e\.g\.,a bouquet of roses→the scent\)\.

#### entity\-resultative

The anaphor is logically inferable from the antecedent\. This is often the result of a transformative/product producing process, like cooking/baking \(e\.g\.,flour→the bread\)\.

#### entity\-associative

The anaphor is an attribute or closely associated entity of the antecedent \(e\.g\.,a library→the books\)\.

#### set\-member

The anaphor is an element of the antecedent set\. This includes group\-member and class\-instance relations \(e\.g\.,several books→the mystery novel\)\.

#### set\-subset

The anaphor is a subset of the antecedent set \(e\.g\.,a group of students→the boys\)\.

#### set\-span\-interval

The anaphor is a sub\-span of the spatial or temporal antecedent interval \(e\.g,Sunday→the morning\)\.

#### other

Theothercategory is for instances which fit the information status based definition of a bridging pair but do not fall into any of the bridging subtype categories outlined above\.

## Appendix BLLMBridge Pipeline Prompts

The following are examples of the prompt templates we use to query the LLM backend of LLMBridge for each bridging resolution subtask\. To develop our LLM prompts for each bridging subtask, we utilized a broad definition of referential bridging that prioritizes high recall across datasets\. While we explored using more specific prompts specialized for each dataset definition of bridging, the resulting lower recall did not outweigh gains in precision\.999During prompt development, we evaluated against a subset documents from the GUMBridge dev set \(GUM\_academic\_librarians, GUM\_conversation\_grounded, GUM\_court\_loan, GUM\_bio\_emperor, GUM\_voyage\_athens\), as well as dev documents we set aside from ISNotes \(wsj\_1450, wsj\_1327, wsj\_1232, wsj\_1163, wsj\_1148\) and BASHI \(wsj\_1846, wsj\_2112, wsj\_1505, wsj\_0242, wsj\_0790\)\.As such, our finalized prompts use the more open definition of referential bridging combined with subtype explanations from the GUMBridge categorization schema and dataset specific few\-shot examples for the individual subtasks\.

#### Anaphor Recognition

You are a linguistic assistant for identifying bridging anaphora\. A bridging anaphor is a newly introduced entity whose referent in the context is inferable based on a relationship to a previous entity or verbal predicate in the discourse \(the antecedent\)\. For instance, "a house … the door" \(antecedent=a house, anaphor=the door\) where we understand that the door is specifically the door of the house, and we cannot interpret the referent of the door without referring to the house\.There are a few types of such anaphora:· Comparisoncomparison\-relative: The anaphor is preceded by a comparative marker \(other, another, same, more, ordinal modifiers, comparative adjectives, superlatives, etc\.\) which implies a comparison to the antecedent\. For example: "The children … another child" \(=another with comparison to the aforementioned children\); similar cases may be similar children, older children \(compared to the aforementioned children\), etc\.comparison\-sense: the semantic type of a phrase requires a previous mention to identify it, for example "the Italian restaurant … a Chinese one" \(we can’t know "a Chinese one" is a restaurant without referring back to the Italian restaurant\), or "another one", "the others" etc\.comparison\-time: the anaphor refers to a specific time/timeframe which is understandable with reference to the antecedent, for example: "Tuesday, February 2nd … the following week"· Entityentity\-meronomy: the anaphor is a subunit of the antecedent \(part\-whole\), including physical subunits, portion\-substance relations, and regions/subsections\. For example: "the house … the door" \(=of the house\)\.entity\-associative: the anaphor is an attribute or closely associated entity of the antecedent, including both prototypical and inducible associations: "a wedding … the bride" \(=the bride at that wedding\), implicit arguments of a predicate or a verbal nominalization: "a play… the performance" \(=of the play\), relational nouns: "a murder … the victim"entity\-property: the anaphor is a physical or intangible property of the antecedent \(e\.g\., smell, length, size, style, etc\.\): "the tea … the sweet aroma"entity\-resultative: the anaphor is logically inferable from the antecedent \(e\.g\., result, transformation/transmutation, cause\): "the dough … the bread" \(=the dough becomes bread after baking\)· Setset\-member: the anaphor is an element of the antecedent set, including groups\-member relations and classes\-instances: "the cars … the Mazda", additionally indefinite members to definite sets: "a candle on each cupcake … the candles"set\-subset: the anaphor is a subset of the antecedent set: "the cars … the Mazdas" \(not all Mazdas, just the subset among the aforementioned cars\)set\-span\-interval: the anaphor is a sub\-span of a spatial or temporal interval defined by the antecedent: "last week … Wednesday" \(=Wednesday of last week\), "Sunday … the morning" \(=the morning portion of that Sunday\)TASK DEFINITIONGiven a larger passage and a specified subspan labeled “Text: …”, return a list of all bridging anaphora \(newly introduced entities whose referent is inferable based on a relationship to a previous entity or verbal predicate\)\. Output must be the exact surface strings from the subspan\. If none, return \[\]\.OUTPUT CONSTRAINTS \(STRICT\)\- Return a JSON\-style list of exact strings from the candidate list\.\- If no candidates are bridging, return: \[\]\- Do not add explanations or any extra text\.CRITICAL REMINDERS\- The bridging anaphor must be a newly introduced entity \(it cannot corefer with anything the occurs before it in the text\)\- The interpretation of the bridging anaphor MUST depend on a PREVIOUS entity or verbal predicate for interpretation\- Return bridging anaphora as contiguous mention span exactly as written in the candidate list \(including unusual spacing, hyphenation, parentheses tokens like \-LRB\-/\-RRB\-, and any trailing comma that is part of the noun phrase\)\.\- Entities should be considered with their full phrases – if a noun is expanded by modifier clauses etc\., include the entire noun phrase \(maximal projection\), e\.g\., in "the man who I saw", not just "the man"TASK EXAMPLES\{dataset\_specific\_examples\}TASKConsider the following text:\{context\_text\}Please return a list of all bridging anaphora in the following subspan of the text\.Text:\{text\}Answer\(s\):

#### Anaphor Resolution

You are a linguistic assistant tasked with finding the antecedent of a bridging anaphor\. A bridging anaphor is a newly introduced entity whose referent in the context is inferable based on a relationship to a previous entity or verbal predicate in the discourse \(the antecedent\)\. For instance, "a house … the door" \(antecedent=a house, anaphor=the door\) where we understand that the door is specifically the door of the house, and we cannot interpret the referent of the door without referring to the house\. The antecedent is the entity of verabl predicate that allows the reader to understand the referent of the bridging anaphor\.There are a few types of such anaphora with their antecedents:· Comparisoncomparison\-relative: The anaphor is preceded by a comparative marker \(other, another, same, more, ordinal modifiers, comparative adjectives, superlatives, etc\.\) which implies a comparison to the antecedent\. For example: "The children … another child" \(=another with comparison to the aforementioned children, antecedent=The children\); similar cases may be similar children, older children \(compared to the aforementioned children\), etc\.comparison\-sense: the semantic type of a phrase requires a previous mention to identify it, for example "the Italian restaurant … a Chinese one" \(we can’t know "a Chinese one" is a restaurant without referring back to the Italian restaurant, antecedent=the Italian restaurant\), or "another one", "the others" etc\.comparison\-time: the anaphor refers to a specific time/timeframe which is understandable with reference to the antecedent, for example: "Tuesday, February 2nd … the following week" \(antecedent=Tuesday, February 2nd\)· Entityentity\-meronomy: the anaphor is a subunit of the antecedent \(part\-whole\), including physical subunits, portion\-substance relations, and regions/subsections\. For example: "the house … the door" \(=of the house, antecedent=the house\)\.entity\-associative: the anaphor is an attribute or closely associated entity of the antecedent, including both prototypical and inducible associations: "a wedding … the bride" \(=the bride at that wedding, antecedent=a wedding\), implicit arguments of a predicate or a verbal nominalization: "a play… the performance" \(=of the play, antecedent=the play\), relational nouns: "a murder … the victim" \(antecedent=a murder\)entity\-property: the anaphor is a physical or intangible property of the antecedent \(e\.g\., smell, length, size, style, etc\.\): "the tea … the sweet aroma" \(antecedent=the tea\)entity\-resultative: the anaphor is logically inferable from the antecedent \(e\.g\., result, transformation/transmutation, cause\): "the dough … the bread" \(=the dough becomes bread after baking, antecedent=the dough\)· Setset\-member: the anaphor is an element of the antecedent set, including groups\-member relations and classes\-instances: "the cars … the Mazda" \(antecedent=the cars\), additionally indefinite members to definite sets: "a candle on each cupcake … the candles" \(antecedent=a candle on each cupcake\)set\-subset: the anaphor is a subset of the antecedent set: "the cars … the Mazdas" \(not all Mazdas, just the subset among the aforementioned cars, antecedent=the cars\)set\-span\-interval: the anaphor is a sub\-span of a spatial or temporal interval defined by the antecedent: "last week … Wednesday" \(=Wednesday of last week, antecedent=last week\), "Sunday … the morning" \(=the morning portion of that Sunday, antecedent=Sunday\)TASK DEFINITIONYou will be given a text with a possible anaphor marked in double curly brackets: \{\{ \}\}\. Output exactly the string of the antecedent \(if there is one\), or ’no antecedent’ \(if there is no antecedent in the given text\)\. If there are multiple mentions of the antecedent, select the one closest to \(but still before\) the anaphor, even if it is a pronoun\. For example in: "the house … it … \{\{the door\}\}", if "it" refers to the house, the correct solution is: itOUTPUT CONSTRAINTS \(STRICT\)\- Return exactly one string: the antecedent mention copied verbatim from the text\.\- The string must be a single, contiguous span that:\- Appears before the marked anaphor\- Is not identical to the anaphor and is not coreferential with it\- Do not add explanations, quotes, brackets, or multiple spans\.\- If no associative antecedent exists in the text \(rare\), return exactly: no antecedentCRITICAL REMINDERS\- The antecedent must precede the anaphor\.\- Return a single contiguous mention span exactly as written \(including unusual spacing, hyphenation, parentheses tokens like \-LRB\-/\-RRB\-, and any trailing comma that is part of the noun phrase\)\.\- Return full phrases – if a noun is expanded by modifier clauses etc\., include the entire noun phrase \(maximal projection\)\- Do not return a paraphrase or combine multiple spans\.\- If no specific prior mention is required to interpret the anaphor, output: ’no antecedent’TASK EXAMPLES\{dataset\_specific\_examples\}TASKPlease return a single string for the associative antecedent of the bridging anaphor surrounded by double curly brackets: \{\{ \}\}\.Text:\{text\}Answer:

#### Subtype Classification

TASK DEFINITIONClassify the subtype\(s\) of bridging relation between one marked anaphor and its corresponding marked antecedent\.INPUT FORMAT\- You are given a text containing one bridging anaphor–antecedent pair\- the anaphor, marked with double curly brackets: \{\{ \}\}\- the antecedent, marked with asterisks: \* \*\- The relation to classify is specifically between the starred antecedent mention and the double\-braced anaphor, using the surrounding context as needed\.WHAT COUNTS AS A BRIDGING ANAPHOR\- A newly introduced noun phrase \(NP\) whose interpretation depends on a previously mentioned but non\-identical entity or verbal predicate \(the antecedent\)\.\- The anaphor does not corefer with the antecedent\.DECISION PROCEDURE \(IN ORDER\)1\) Identify the semantic relation\(s\) needed to interpret the anaphor given the antecedent\.2\) Assign all applicable subtypes that are directly licensed by the text and conventional world knowledge \(e\.g\., typical roles/participants, frames\)\.\- Multiple subtypes may apply; include all that fit\.3\) If none of the defined subtypes apply, output "other"\.SUBTYPE DEFINITIONS \(WITH DISAMBIGUATION RULES\)Comparison\-based\- comparison\-relative: The anaphor is introduced with a comparative/superlative/ordinal or related marker \(e\.g\., another, other, same, more, most, less, fewer, similar, different, next, last, first, second, better, best, worse, worst\)\.\- Includes temporal NPs like “the last time,” “the next day” when they are picked out relative to a previously mentioned event/situation\.\- comparison\-sense: The anaphor’s type/kind is recovered from the antecedent mention \(e\.g\., “one/ones/others” whose category is supplied by the antecedent; or an NP whose category is understood from the prior mention\)\.\- Often co\-occurs with comparison\-relative when both a comparative marker and type\-recovery are present \(e\.g\., “the others” relative to a prior set; “the last time” where “time of \[that event\]” is understood from context\)\.\- comparison\-time: Use only for temporal anchoring across intervals \(e\.g\., a shift from a previously mentioned specific time span to a following/preceding larger or adjacent interval\)\. Example: “Tuesday” → “the following week”\.\- Do NOT use for ordinal/superlative temporal NPs like “the last time” anchored to a prior event; prefer comparison\-relative \(and add comparison\-sense if the event/time type comes from the antecedent\)\.Entity\-based\- entity\-meronomy: Part–whole relations \(physical subparts, regions/subsections, portion–substance\)\.\- Examples: “the house” → “the door”; “the cake” → “a slice”\.\- entity\-associative: Prototypical associations, roles, attributes, frames, or implicit arguments tied to the antecedent entity or event\.\- Examples: “a wedding” → “the bride”; “a play” → “the performance”; a person → their typical activities like “work” or “sleep”\.\- entity\-property: A property or attribute of the antecedent \(physical or intangible\), e\.g\., smell, size, style, mood\.\- Example: “the tea” → “the sweet aroma”\.\- entity\-resultative: Result/cause or transformation relations \(inputs → outputs or vice versa\)\.\- Example: “the flour” → “the bread”\.Set\-based\- set\-member: Group–member or class–instance relations\.\- Examples: “the cars” → “the Mazda”; “mammals” → “a whale”\.\- set\-subset: A subset picked from a previously mentioned set\.\- Example: “the cars” → “the Mazdas”\.\- set\-span\-interval: A sub\-span within a previously defined spatial or temporal interval\.\- Example: “last week” → “Wednesday”\.\- Include this with comparison\-time when the anaphor is a clear sub\-part of an explicit interval named by the antecedent\.Other\- Use “other” only when none of the above categories fit but interpretation still depends on the antecedent\.KEY DISAMBIGUATION GUIDELINES\- Comparison vs\. time:\- Ordinal/comparative temporal expressions like “the last time,” “the next day” → comparison\-relative and comparison\-time\. If their type \(“time of that event”\) comes from the prior mention, also add comparison\-sense\.\- Cross\-interval anchoring like “Tuesday” → “the following week” → comparison\-relative and comparison\-time\.\- Type\-recovery:\- “one/ones/others” whose category is supplied by the antecedent → comparison\-sense \(often also comparison\-relative if a comparative marker is present\)\.\- Prototypical associations:\- Daily routines/activities linked to a person or role \(e\.g\., “A contemporary American” → “work”; “you” → “sleep”\) → entity\-associative\.\- Set vs\. meronymy:\- Member\-of\-a\-set/class → set\-member\. Physical part\-of → entity\-meronomy\.\- Use all applicable labels when multiple relations are licensed\.OUTPUT CONSTRAINTS \(STRICT\)\- Output only the label\(s\)\.\- If multiple labels apply, join them with semicolons and NO SPACES\. Example: comparison\-sense;comparison\-relative\- Do not add explanations or any other text\.\- Do not invent labels\.\- Allowed labels \(exact spellings\):comparison\-relativecomparison\-sensecomparison\-timeentity\-associativeentity\-meronomyentity\-propertyentity\-resultativeset\-memberset\-subsetset\-span\-intervalotherWORKED EXAMPLE PATTERNS \(FOR CONSISTENCY\)\- Antecedent: prior event/situation; Anaphor: “the last time” → comparison\-sense;comparison\-relative\- Antecedent: “A contemporary American”; Anaphor: “work” → entity\-associative\- Antecedent: “you”; Anaphor: “sleep” → entity\-associativeTASK EXAMPLES\{dataset\_specific\_examples\}TASKIn the following text, a bridging anaphor is marked with double curly brackets \{\{ \}\}, and the corresponding antecedent is surrounded by asterisks: \* \* \.Classify the subtype\(s\) of bridging relation that hold between the two entities\.Text:\{text\}Answer:
LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English

Similar Articles

RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution

DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline

Submit Feedback

Similar Articles

RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation
Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution
DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation
ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline