SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation
Summary
SwanNLP presents an LLM-based framework for plausibility scoring in narrative word sense disambiguation at SemEval-2026 Task 5, using structured reasoning and dynamic few-shot prompting to predict human-perceived plausibility of word senses in short stories. The work demonstrates that commercial large-parameter LLMs with few-shot prompting and model ensembling effectively replicate human judgment patterns in realistic narrative contexts.
View Cached Full Text
Cached at: 04/20/26, 08:30 AM
# SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation
Source: https://arxiv.org/html/2604.16262
Deshan Sumanathilaka, Nicholas Micallef, Julian Hough, Saman Jayasinghe Department of Computer Science, Swansea University, Wales, UK \{t\.g\.d\.sumanathilaka, nicholas\.micallef, julian\.hough, s\.j\.j\.g\.galgodagedon\}@swansea\.ac\.uk
###### Abstract
Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored. SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions.
SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation
Deshan Sumanathilaka, Nicholas Micallef, Julian Hough, Saman Jayasinghe Department of Computer Science, Swansea University, Wales, UK \{t\.g\.d\.sumanathilaka, nicholas\.micallef, julian\.hough, s\.j\.j\.g\.galgodagedon\}@swansea\.ac\.uk
## 1 Introduction
With the introduction of Transformer models Vaswani et al. (2017 https://arxiv.org/html/2604.16262#bib.bib3), language models became increasingly capable at a variety of Natural Language Processing (NLP) tasks. Lexical ambiguity became a major challenge for traditional models due to the occurrence of two or more completely unrelated senses sharing the same spelling or pronunciation Bevilacqua et al. (2021 https://arxiv.org/html/2604.16262#bib.bib7). Even though recent studies reveal that large language models (LLMs) perform strongly at Word Sense Disambiguation (WSD) on common words and senses, their performance deteriorates in identifying rare cases or less frequent words Sumanathilaka et al. (2024b https://arxiv.org/html/2604.16262#bib.bib19); Meconi et al. (2025 https://arxiv.org/html/2604.16262#bib.bib8).
The existing benchmarks predominantly operate at the sentence level, where disambiguating homonyms are mostly based on neighboring-word clues, global context analysis, and syntactic clues and dependency relations Raganato et al. (2017 https://arxiv.org/html/2604.16262#bib.bib11); Ballout et al. (2024 https://arxiv.org/html/2604.16262#bib.bib12); Blevins et al. (2021 https://arxiv.org/html/2604.16262#bib.bib13). While these datasets are effective in constrained settings, this formulation has inherent limitations, such as isolated sentences often providing insufficient contextual evidence, failing to reflect the richer, multi-sentence contexts required for realistic real-world language understanding.
To address this gap, SemEval-2026 Task 5 introduces the AmbiStory dataset Gehring and Roth (2025 https://arxiv.org/html/2604.16262#bib.bib1), which comprises narrative texts that naturally encode ambiguity through their discourse structure. Each instance consists of a short narrative with four to five sentences providing situational context (the precontext), followed by an ambiguous target sentence and, optionally, a concluding sentence. By modeling ambiguity within narrative settings, this benchmark enables a more realistic evaluation of the applicability of WSD algorithms in real-world scenarios. The source code for our implementation is publicly available at https://github.com/Sumanathilaka/SwanNLP-at-SemEval-2026-Task-5.
### 1.1 Task Overview
SemEval-2026 Task 5 Gehring et al. (2026 https://arxiv.org/html/2604.16262#bib.bib2), "Rating Plausibility of Word Senses in Ambiguous Sentences through Narrative Understanding", is designed to evaluate the ability of computational models to simulate human reasoning when interpreting the sense of a homonym in a narrative context. Participants are required to predict a human-perceived plausibility score (1-5) for each story. Each instance consists of a precontext that grounds the narrative, an ambiguous sentence containing a homonym, and, optionally, an ending that often implies a particular word sense.
The underlying dataset, AmbiStory, was annotated by human participants recruited via Prolific, with five independent annotations per instance on a five-point plausibility scale. Participants are provided with training, development, and test splits. The primary evaluation metrics are the Spearman correlation between predicted scores and the average human judgment and Accuracy within Standard Deviation, defined as the proportion of model predictions falling within at least one standard deviation of the annotators' mean score. Dataset statistics are summarized in Table 1 https://arxiv.org/html/2604.16262#S1.T1.
Table 1: Dataset Statistics. Unique senses represent number of unique homonyms captured in each set.
## 2 Related Work
Recent studies show that LLMs are substantially more effective at disambiguating commonly used homonyms, due to their ability to model rich contextual and semantic cues Cahyawijaya et al. (2024 https://arxiv.org/html/2604.16262#bib.bib14); Meconi et al. (2025 https://arxiv.org/html/2604.16262#bib.bib8). Although parameter-efficient models such as Qwen and Gemma lack the deep contextual representations of large-scale LLMs, fine-tuning has been shown to yield significant performance gains over their base versions Basile et al. (2025 https://arxiv.org/html/2604.16262#bib.bib15), suggesting a more energy-efficient alternative for domain or task-specific disambiguation. These findings motivate our use of a supervised fine-tuning framework to simulate human plausibility judgments. In particular, we model both single-annotator and aggregated multi-annotator perspectives through a reasoning-driven pipeline that incorporates difficulty-aware analysis.
Prior work has also explored reformulating WSD as a higher-level reasoning task. Sainz et al. (2023 https://arxiv.org/html/2604.16262#bib.bib22) cast WSD as a textual entailment problem, prompting models to assess the compatibility between candidate sense descriptions and sentences containing ambiguous words. This zero-shot formulation outperforms random baselines and, in some cases, rivals supervised WSD systems. Complementarily, Sumanathilaka et al. (2025b https://arxiv.org/html/2604.16262#bib.bib18) investigates prompt engineering and in-context learning strategies with GPT-3.5-Turbo and GPT-4-Turbo can substantially improve the disambiguation task, while subsequent benchmarking identifies GPT-based models and DeepSeek as particularly effective for WSD Sumanathilaka et al. (2024a https://arxiv.org/html/2604.16262#bib.bib10). Collectively, these studies indicate that LLMs are well-suited for sense disambiguation in context and that reasoning-oriented prompting can further enhance performance. Building on this line of work, our approach aims to enhance the challenging task of plausibility score prediction by leveraging dynamic few-shot extraction for in-context learning with larger models.
## 3 Methodology
In this study, we have evaluated three sets of approaches for predicting plausibility scores. We have used a supervised fine-tuning approach (SFT) with low-parameter LLMs, in-context learning via dynamic Retrieval Augmented Generation (RAG) with Chain-of-Thought (CoT) reasoning Wei et al. (2022 https://arxiv.org/html/2604.16262#bib.bib21) and model ensembling to further simulate the multi-annotator agreement. To support different reasoning processes across all approaches, we have classified the data into plausibility levels for use in each phase.
To implement the fine-tuning logic, we used average scores to identify the possible outcomes, which were determined by human annotators during the annotation process. This inferred rationale was then consistently incorporated into the fine-tuning procedure to simulate human judgment.
- **Average ≥ 4.0**: Meaning strongly fits the context and the ending, indicating high plausibility.
- **3.0 ≤ Average < 4.0**: Meaning reasonably fits the context and the ending, indicating moderate plausibility.
- **2.0 ≤ Average < 3.0**: Meaning shows a weak connection to the context and the ending, indicating slight plausibility.
- **Average < 2.0**: Meaning does not fit the context or the ending and is therefore not plausible.
These plausibility bands were defined through a data-driven discretization of the averaged human ratings, informed by the score distribution and the annotation characteristics of the dataset Snow et al. (2008 https://arxiv.org/html/2604.16262#bib.bib16).
Figure 1: The flow used to classify whether a case is easy or difficult for humans.
### 3.1 Fine-tuning Low Parameter Models
For the fine-tuning task, we employ four techniques to model the reasoning process. The fine-tuning procedure is incremental, incorporating multiple reasoning strategies proposed in prior work to improve WSD performance Sumanathilaka et al. (2026 https://arxiv.org/html/2604.16262#bib.bib17). We adapt these approaches to better align with the task of simulating human annotation of plausibility scores. Each technique is specifically designed to reflect how humans evaluate and assign plausibility. The design strategies used in this study are outlined as follows.
- **Single-annotator scoring**: Given a short story, a target ambiguous word, a candidate sense, and a narrative ending, the model assigns a single plausibility score corresponding to the judgment of one annotator.
- **Multi-annotator scoring**: Given a short story, a target ambiguous word, a candidate sense, and a narrative ending, the model generates five plausibility scores to simulate the judgments of five different annotators and capture variation in human reasoning.
- **Reasoning-based scoring**: Given a story context, a target ambiguous word, a candidate sense, and a narrative ending, the model first determines whether the candidate sense is plausible by analyzing the surrounding contextual cues and the narrative ending. It then assigns the most appropriate plausibility score based on this reasoning.
- **Difficulty-aware scoring**: Given a story context, a target ambiguous word, a candidate sense, and a narrative ending, the model first identifies the difficulty level of the instance, such as human-easy or human-ambiguous, by analyzing the characteristics of the story. It then reasons about the candidate sense and assigns a plausibility score accordingly.
The models were fine-tuned to simulate both single-annotator and multi-annotator scoring processes in line with the design goals described above. The fine-tuning inputs, including the reasoning rationale and the expected output format for each strategy, are presented in Table 2 https://arxiv.org/html/2604.16262#S3.T2. In the third fine-tuning strategy, we used GLOSSGPT Sumanathilaka et al. (2025b https://arxiv.org/html/2604.16262#bib.bib18), a state-of-the-art word sense disambiguation system, to provide a likely sense interpretation for each homonymous target word based on the story context. This sense prediction was precomputed and used as an auxiliary reasoning cue during fine-tuning, rather than treated as a definitive ground-truth label.
More specifically, GLOSSGPT was used to identify the most probable sense supported by the contextual clues in the story, which helped structure the intermediate reasoning process for plausibility scoring. However, because plausibility judgments may vary across annotators, especially in human-ambiguous cases, GLOSSGPT's role was limited to providing a stable semantic reference point. The fine-tuning rationale was designed to reduce over-reliance on a single predicted sense and instead emphasize how low-parameter LLMs can model human-like variation in plausibility assessment, particularly when the context supports more than one reasonable interpretation.
Table 2: Mapping of fine-tuning strategies to their inputs for training and expected outputs during inference.
#### 3.1.1 Study Setup
For model development, supervised fine-tuning (SFT) was employed. The baseline models were fine-tuned using Low-Rank Adaptation (LoRA) to enable efficient training while reducing computational overhead. Training data were pre-processed into a desired chat-style prompt–response format, with each query containing homonym, pre-context, sentence and ending. For the last two phases of the fine-tuning, related pre-computed senses and difficulty tags were incorporated to build the reasoning logic.
Fine-tuning was conducted using Hugging Face transformers and trl libraries, with a custom prompt formatting function to standardize inputs. The data were tokenized using the native tokenizer of each model. We used a batch size of 4, gradient accumulation over 8 steps, and a learning rate of 2×10⁻⁴ with these settings chosen based on preliminary experimentation. Optimization was performed using AdamW with a linear learning-rate scheduler. All experiments were conducted on an NVIDIA A100-PCIE-40GB GPU, which provided sufficient memory to fine-tune the models efficiently without quantization in the final configuration.
### 3.2 Dynamic Few-shot Learning for Commercial Models
During the first phase of the study, we observed that small models, including fine-tuned variants, are not well suited to handling ambiguous sentences that humans judge as moderately plausible. To address this limitation, we propose a RAG-inspired approach Lewis et al. (2020 https://arxiv.org/html/2604.16262#bib.bib5) that enriches in-context knowledge during inference in large-parameter models. Following the logic illustrated in Figure 1 https://arxiv.org/html/2604.16262#S3.F1, we subdivided the training data into three categories.
**Ambiguous Context (1088 Records)**: This category contains instances for which human annotators assigned diverse plausibility scores. Using the annotations' mean and standard deviation, we identified cases with substantial disagreement among annotators, indicating high ambiguity.
**Human Easy - High Score (631 Records)**: This category consists of instances that humans can annotate with high confidence. When all annotators are in agreement and consistently assign high scores, it indicates that the given meaning is highly suitable for the ambiguous sentence.
**Human Easy - Low Score (561 Records)**: This category also includes instances that are easy for humans to annotate, but where annotators consistently assign low scores. Such an agreement indicates that the given meaning is not suitable for the homonym in the provided context.
This categorization formed the basis for constructing the vector stores used as retrievers in our few-shot inference setting. Each story was treated as a separate chunk in the vector store, and embeddings were generated using BAAI/bge-small-en-v1.5. A FAISS vector index was used to store the story embeddings. The retriever employs similarity_search to fetch the most relevant few-shot examples during the inference process.
Initial experiments were conducted with K=1, 2, 3, and we observed thatSimilar Articles
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
Researchers from PNNL and Washington University introduce a systematic framework to test how five LLMs detect subtle semantic changes in documents, revealing positional bias, context coherence effects, and model-specific scoring fingerprints.
RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation
This paper presents the winning system for SemEval-2026 Task 8's generation subtask, using a heterogeneous ensemble of seven LLMs with dual prompting strategies and a GPT-4o-mini judge to select the best response. The system achieved first place with a conditioned harmonic mean of 0.7827, outperforming all baselines and demonstrating the value of model diversity.
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.
Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models
UMBC researchers show LLMs judge scientific claim feasibility better when given outcome data than experiment descriptions, and that incomplete experimental context can hurt accuracy.
MeasHalu: Mitigation of Scientific Measurement Hallucinations for Large Language Models with Enhanced Reasoning
MeasHalu is a novel framework for mitigating scientific measurement hallucinations in LLMs through a two-stage reasoning-aware fine-tuning strategy and progressive reward curriculum. It introduces a fine-grained taxonomy of measurement-specific hallucinations and demonstrates improved accuracy on the MeasEval benchmark.