Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding
Summary
Researchers introduce a method to automatically augment commonsense knowledge corpora with negation, creating 2M+ triples that improve LLM negation understanding when used for pre-training.
View Cached Full Text
Cached at: 04/23/26, 10:03 AM
# A Resource to Enhance Negation Understanding
Source: [https://arxiv.org/html/2604.19921](https://arxiv.org/html/2604.19921)
## Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding
Zijie Wang1MohammadHossein Rezaei1Farzana Rashid2Eduardo Blanco1 1University of Arizona2University of North Carolina Asheville \{zijiewang, mhrezaei, eduardoblanco\}@arizona\.edufrashid@unca\.edu
###### Abstract
Negation is a common and important semantic feature in natural language, yet Large Language Models \(LLMs\) struggle when negation is involved in natural language understanding tasks\. Commonsense knowledge, on the other hand, despite being a well\-studied topic, lacks investigations involving negation\. In this work, we show that commonsense knowledge with negation is challenging for models to understand\. We present a novel approach to automatically augment existing commonsense knowledge corpora with negation, yielding two new corpora containing over 2M triples with*if\-then*relations\. In addition, pre\-training LLMs on our corpora benefits negation understanding\.
Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding
Zijie Wang1MohammadHossein Rezaei1Farzana Rashid2Eduardo Blanco11University of Arizona2University of North Carolina Asheville\{zijiewang, mhrezaei, eduardoblanco\}@arizona\.edufrashid@unca\.edu
## 1Introduction
Negation is a common and important semantic feature in natural language, appearing in approximately 25% of English sentencesHossainet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib25)\)\. Despite the recent success of large language models \(LLMs\) across various natural language processing tasksAchiamet al\.\([2023](https://arxiv.org/html/2604.19921#bib.bib10)\); Touvronet al\.\([2023](https://arxiv.org/html/2604.19921#bib.bib6)\), their understanding of negation remains unclear\. Previous work has demonstrated that language models struggle with multiple natural language understanding tasks when negation is involvedDobreva and Keller \([2021](https://arxiv.org/html/2604.19921#bib.bib36)\); Janget al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib38)\)\. However, these investigations have been limited to encoder\-based models such as BERTDevlinet al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib37)\)and earlier LLMs such as GPT\-3Brownet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib31)\)\.
Commonsense knowledge has been extensively studied, with numerous efforts focused on building commonsense knowledge basesSpeeret al\.\([2017](https://arxiv.org/html/2604.19921#bib.bib5)\); Sapet al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib4)\)\. Commonsense reasoning has also been investigated through tasks such as question answeringTalmoret al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib18)\)\. Despite these extensive efforts, the intersection of commonsense knowledge and negation remains underexplored\.
Figure 1:A commonsense knowledge triple with the*Intention*relation\. Negations are added to both*if*and*then*events\. Adding different negation cues results in new triples that align \(same color on both sides\) or conflict with \(different colors\) commonsense knowledge\.AtomicSapet al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib4)\)is one of the largest commonsense knowledge corpora with*if\-then*relations, representing commonsense knowledge as a triple <A,R,B\>, whereArepresents the*if*event,Rrepresents the relation, andBrepresents the*then*event\. For example, <*the person is hungry*,*the person wants to*,*eat food*\> represents the commonsense knowledge that “*if the person is hungry, then the person wants to eat food*\.” To our knowledge,AnionJianget al\.\([2021](https://arxiv.org/html/2604.19921#bib.bib14)\)is the only work that develops a commonsense knowledge base with negation, built on top ofAtomic\. However, their approach only negates the*if*event and generates a new*then*event via human annotation, resulting in a new triple <¬\\negA,R,B′\>\. This is limited because it does not consider negating the*then*event and requires significant human annotation effort\.
In this work, we present a novel approach to automatically augment existing commonsense knowledge corpora with negation\. Our method is motivated by the observation that negating either the*if*event, the*then*event, or both can sometimes produce new triples that still align with commonsense knowledge\. As shown in the example in Figure[1](https://arxiv.org/html/2604.19921#S1.F1), negating the*if*event*“Eulerunsuccessfullyapplied for a position in Physics”*implies that*the intention of Euler*was to*“teach in Physics\.”*In contrast, negating the*then*event*“notteach in Physics”*yields a triple that conflicts with commonsense knowledge \(green*if*event and red*then*event\)\. This observation motivates us to augment both*if*and*then*events with negation, expanding existing commonsense knowledge corpora by up to three times \(negating the*if*event, the*then*event, or both\)\. In addition, our approach avoids relying on human effort to generate new events such as*“notjoin Math department”*\. We further propose an automatic validation method to verify whether the new commonsense triples with negation align with commonsense knowledge\. More importantly, we demonstrate that they benefit LLMs’ understanding of negation in general\-purpose tasks\. The main contributions of this paper are:111Code and dataset available at[https://github\.com/wang\-zijie/commonsense\_with\_negation](https://github.com/wang-zijie/commonsense_with_negation)
- •A novel approach to develop two commonsense knowledge corpora with over 2M triples containing negation\.
- •An automatic method to validate commonsense triples with negation\.
- •Evaluation on multiple models across five benchmarks, demonstrating that our corpora improve LLMs’ understanding of negation\.
## 2Related Work
#### Commonsense knowledge bases
\(CSKB\) have been developed across multiple works with different focuses\. ConceptNetSpeeret al\.\([2017](https://arxiv.org/html/2604.19921#bib.bib5)\)represents taxonomic commonsense knowledge as a graph, where each concept \(i\.e\., word or phrase\) is connected by a relation\. For example,*a net*is used for*catching fish*\.Sapet al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib4)\)focus on relations between events rather than concepts, proposing nine*if\-then*relations representing inferential knowledge\. For example, if*X pays Y a compliment*, then Y wants to*return the compliment*\. To our knowledge,AnionJianget al\.\([2021](https://arxiv.org/html/2604.19921#bib.bib14)\)is the only work that investigates commonsense knowledge with negation\. It is built by negating the*if*event fromAtomicand manually annotating a new*then*event\. For example, if*X does not pay Y a compliment*, then the effect on Y is*upset*\. This approach limits the possibility of negating the*then*event and requires significant human annotation effort\.Arnaoutet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib2)\)develop a method to identify informative negations of commonsense concepts\. The dataset complements existing CSKB as they often do not capture any negative concepts such as “gorillas are not territorial\.” The UNcommonsense datasetZhaoet al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib22)\)collects data involving unusual, unexpected, and unlikely situations, namely uncommonsense knowledge\. The primary task asks LLMs to generate reasonable explanations given contexts with uncommon outcomes\. Beyond manual efforts to create CSKB,Bosselutet al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib21)\)propose COMET, a system that trains a transformer model to automatically generate the*then*event given the*if*event and relation\.Westet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib20)\)distill commonsense knowledge from LLMs to train a transformer for commonsense graph generation\. Unlike existing works, we propose a novel approach to incorporate negation into existing CSKB\. Our approach generates large\-scale commonsense knowledge triples with negation while requiring no human effort\.
#### Commonsense Knowledge for Downstream Tasks
Commonsense knowledge has been shown to benefit many downstream tasks\.Talmoret al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib18)\)present a question answering dataset called CommonsenseQA that involves reasoning over commonsense knowledge\. They develop multiple\-choice questions based on the concept\-relation graph from ConceptNet\.Lalet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib29)\)observe that answering why\-questions often requires commonsense reasoning and leverage COMET to generate relevant commonsense knowledge for this task\.Guanet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib19)\)pre\-train a Transformer model on external commonsense knowledge bases to improve story generation\. In this work, we leverage augmented commonsense knowledge corpora with negation to improve LLMs’ understanding of negation, with experimental results across three tasks demonstrating its effectiveness\.
#### Improving Models’ Negation Understanding
Prior work has shown that LLMs struggle with natural language understanding tasks involving negationDobreva and Keller \([2021](https://arxiv.org/html/2604.19921#bib.bib36)\); Janget al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib38)\), motivating efforts to address this limitation\.Hosseiniet al\.\([2021](https://arxiv.org/html/2604.19921#bib.bib27)\)leverage unlikelihood training and synthetic data generation to improve BERT’s understanding of negation\.Singhet al\.\([2023](https://arxiv.org/html/2604.19921#bib.bib9)\)modify BERT’s next sentence prediction task by incorporating negation cues rather than random sentences\.Rezaei and Blanco \([2024](https://arxiv.org/html/2604.19921#bib.bib8)\)show that incorporating affirmative interpretations improves performance on negation understanding benchmarks\. In this work, we demonstrate that state\-of\-the\-art LLMs still lack robust negation understanding and develop two commonsense knowledge corpora augmented with negation\. More importantly, pre\-training on our corpora improves performance on multiple tasks requiring negation understanding\.
ValidInvalidAmbiguousOverallPRF1PRF1PRF1PRF1AccFew\-shot learningLlama 3\.1 8B0\.560\.580\.570\.400\.710\.520\.540\.070\.130\.450\.450\.370\.44Llama 3\.1 70B0\.730\.540\.620\.500\.730\.590\.480\.380\.430\.560\.550\.530\.53GPT\-4o0\.710\.480\.510\.540\.650\.590\.510\.540\.510\.530\.540\.520\.54Claude Sonnet 40\.830\.380\.530\.510\.630\.560\.460\.640\.540\.610\.560\.560\.56Fine\-tuningLlama 3\.1 8B0\.570\.800\.670\.670\.480\.560\.530\.450\.480\.580\.570\.550\.56Llama 3\.1 70B0\.700\.760\.730\.790\.480\.590\.510\.680\.580\.650\.630\.630\.64Table 1:Results of validating commonsense triples with negation via \(1\) few\-shot learning with Llama 3\.1 8B, 70B, GPT\-4o, and Claude Sonnet 4, and \(2\) fine\-tuning Llama 3\.1 8B and 70B\. We report Precision \(P\), Recall \(R\), and F1 for each label and overall, along with overall Accuracy \(Acc\)\. Only the best fine\-tuning results are reported across variants of whetherValidandAmbiguoustraining instances come fromAtomic,Anion, or both;Invalidtraining data is synthesized using GPT\-4o\. Full results are reported in Appendix[B\.1](https://arxiv.org/html/2604.19921#A2.SS1)\.
## 3Augmenting Commonsense Knowledge Triples with Negation
We propose a novel approach to automatically augment existing commonsense knowledge corpora with negation\. Building onAtomicSapet al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib4)\)andAnionJianget al\.\([2021](https://arxiv.org/html/2604.19921#bib.bib14)\), we develop two new commonsense knowledge corpora:¬\\negAtomicand¬\\negAnion\. We further introduce an automatic validation method that categorizes the generated commonsense triples as either*Valid*\(aligned with commonsense knowledge\),*Invalid*\(conflicting with commonsense knowledge\), or*Ambiguous*\(neither\)\. Our resulting corpora contain over 2M commonsense triples with negation\.
#### Negating Commonsense Knowledge Triples
As discussed in Section[2](https://arxiv.org/html/2604.19921#S2),Atomicconsists of commonsense knowledge triples with*if\-then*relations connecting two events\.AnionextendsAtomicby negating the*if*event and manually annotating a new*then*event; however, it does not consider negating the*then*event\. Unlike these approaches, we incorporate negation into the*if*event, the*then*event, and both\. Specifically, given a commonsense triple <A,R,B\> fromAtomic, whereAis the*if*event,Ris the relation, andBis the*then*event, we leverage an LLM to add the logical negation cue*not*toA,B, and both, generating three new triples: <¬\\negA,R,B\>, <A,R,¬\\negB\>, and <¬\\negA,R,¬\\negB\>\. For a commonsense triple <¬\\negA,R,B′\> fromAnion, whereB′is a new*then*event distinct fromB, we negate onlyB′to generate <¬\\negA,R,¬\\negB′\>\. We do not negate¬\\negA, as doing so would yield the same*if*event as inAtomic\. Note that we use only the training split to generate these triples\.
To perform negation, we add the negation cue*not*to \(1\) the main verb of the event \(e\.g\., “the persondoes nottake a picture”, “notlook at the picture”\), or \(2\) the modifier when the verb is absent \(e\.g\., “notexcited”\)\. Using manually curated exemplars, we prompt Llama 3\.1 70B to generate the negated events\. A manual evaluation of 200 instances confirms 99% syntactic correctness\.
### 3\.1Validating New Triples
It remains unknown whether the automatically generated triples with negation align with or conflict with commonsense knowledge\. We useValidandInvalidto denote these cases, respectively\. In addition, we observe two scenarios when the triples are consideredAmbiguous: \(1\) the validity is context\-dependent—for example,“If the sun is not shining, then it is daytime”can be eitherValidorInvalidbecause“the sun is not shining”could indicate heavy clouds during daytime or simply nighttime; \(2\) the triple lacks a clear causal connection and thus has ambiguous semantics—for example,“If Person A does not get a gift, then it causes Person A to feel joy\.”
Dataset\# Triples \(%\)\# w/o Neg\. \(%\)\# with Negation \(%\)⟨A,R,B⟩\\langle A,R,B\\rangle⟨¬A,R,B⟩\\langle\\neg A,R,B\\rangle⟨A,R,¬B⟩\\langle A,R,\\neg B\\rangle⟨¬A,R,¬B⟩\\langle\\neg A,R,\\neg B\\rangleExisting CorporaAtomic∗449k449k———Anion∗142k—142k——Our Corpora¬\\negAtomic1,798k\(100\.0\)449k\(100\.0\)449k\(100\.0\)449k\(100\.0\)449k\(100\.0\)*Valid*681k\(37\.9\)376k\(83\.7\)47k\(10\.5\)42k\(9\.2\)216k\(48\.0\)*Invalid*463k\(25\.8\)8k\(2\.0\)128k\(28\.5\)286k\(63\.6\)41k\(9\.1\)*Ambiguous*652k\(36\.3\)64k\(14\.3\)274k\(61\.0\)122k\(27\.2\)192k\(42\.9\)¬\\negAnion285k\(100\.0\)—142k\(100\.0\)—142k\(100\.0\)*Valid*104k\(36\.4\)—77k\(53\.9\)—27k\(18\.9\)*Invalid*46k\(16\.1\)—8k\(5\.6\)—38k\(26\.6\)*Ambiguous*135k\(47\.5\)—58k\(40\.5\)—77k\(54\.5\)Benchmark \(¬\\negAtomic\)7,200\(100\.0\)1,800\(100\.0\)1,800\(100\.0\)1,800\(100\.0\)1,800\(100\.0\)*Valid*2,329\(32\.3\)1,287\(71\.5\)113\(6\.3\)118\(6\.5\)811\(45\.1\)*Invalid*2,150\(29\.9\)41\(2\.3\)823\(45\.7\)1,184\(65\.8\)102\(5\.6\)*Ambiguous*2,721\(37\.8\)472\(26\.2\)864\(48\.0\)498\(27\.7\)887\(49\.3\)Table 2:Statistics of commonsense triples in existing corpora, our corpora, and the benchmark\.Atomiccontains no triples with negation, andAnionnegates only the*if*event\.∗denotes subsets of the datasets: training splits from each corpus, excluding underspecified triples fromAtomicand using only the logical negation split fromAnion\.Recent work leverages LLM\-as\-a\-judge to evaluate LLM outputs such as synthetic datasetsLiet al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib23)\)\. However, we demonstrate that state\-of\-the\-art LLMs, including GPT\-4o and Claude Sonnet 4, lack the ability to evaluate the validity of commonsense knowledge triples with negation\. We address this limitation by training a task\-specific LLM to automatically validate commonsense triples\.
We train LLMs using supervised fine\-tuning with three types of data: \(1\)Validtriples from existing commonsense corpora \(AtomicandAnion\), \(2\)Ambiguoustriples created by randomly combining*if*events,*then*events, and relations from existing triples, similar toFanget al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib1)\), and \(3\)Invalidtriples constructed by prompting GPT\-4o to generate*then*events from existing*if*events\. Note that the training data are considered noisy without any manual inspection\. For evaluation, we construct a benchmark comprising 200 triples per relation type fromAtomicalong with their three negated variations \(7,200 triples total\)\. We recruit two annotators to label each triple asValid,Invalid, orAmbiguous, achieving an inter\-annotator agreement of 0\.62, indicating substantial agreementArtstein and Poesio \([2008](https://arxiv.org/html/2604.19921#bib.bib34)\)\. Note that the training and evaluation data do not overlap, as they are sampled from the original training and test splits, respectively\. Appendix[A](https://arxiv.org/html/2604.19921#A1)provides more details on benchmark creation and annotation\.
We experiment with Llama 3\.1 8B and 70B using three training data variations—differing only in whetherValidandAmbiguoustriples come fromAtomic,Anion, or both;*Invalid*triples are always synthesized via an LLM\. Regardless of the training corpus source, the models are trained on 5,400 training triples \(200 triples per relation per label\) using QLoRADettmerset al\.\([2023](https://arxiv.org/html/2604.19921#bib.bib15)\)with 4\-bit quantization\. Both training and evaluation data are verbalized from commonsense triples to natural language if\-then statements\. Further experimental details including the verbalization mapping can be found in Appendix[B](https://arxiv.org/html/2604.19921#A2)\.
Table[1](https://arxiv.org/html/2604.19921#S2.T1)reports the validation results on our benchmark\. Claude Sonnet 4 outperforms the other LLMs, though it achieves only 0\.56 F1 score overall\. Fine\-tuned Llama 3\.1 70B model outperforms all proprietary models \(F1 score: 0\.63 vs\. 0\.56; Accuracy: 0\.64 vs\. 0\.56\)\. Llama 3\.1 8B shows lower performance, as expected for a smaller model\. We consider precision forValidandInvalidtriples the more critical metric, as only triples with these two labels are used for training \(Section[4](https://arxiv.org/html/2604.19921#S4)\)\. Lower recall is tolerable since it only reduces training set size\. Our best LLM judge achieves 0\.70 and 0\.79 precision forValidandInvalidtriples, respectively\. Moreover, empirical experiments \(Section[5](https://arxiv.org/html/2604.19921#S5)\) demonstrate that the commonsense corpora synthesized using our LLM judge effectively improve LLMs’ negation understanding\.
### 3\.2Analysis of Commonsense Knowledge Corpora with Negation
We develop two new corpora with negation,¬\\negAtomicand¬\\negAnion, which are validated by our LLM judge \(Table[1](https://arxiv.org/html/2604.19921#S2.T1), the last row\) with three labels,*Valid*,*Invalid*, and*Ambiguous*\. Table[2](https://arxiv.org/html/2604.19921#S3.T2)presents statistics for the existing corpora \(AtomicandAnion\), our corpora, and our benchmark\.
We use only a subset of the existing corpora: the training split, excluding underspecified triples \(e\.g\., PersonX sees \_\_\_ in the water\) fromAtomicand using only the logical negation split fromAnion\. Our approach augmenting commonsense triples with negation is effective: 37\.9% of the commonsense triples from¬\\negAtomic\(36\.4% from¬\\negAnion\) are identified as*Valid*, aligning with commonsense knowledge, while fewer triples are identified as*Invalid*\(¬\\negAtomic: 25\.8%;¬\\negAnion: 16\.1%\)\. More importantly, they are augmented with negation\. As we demonstrate in Section[5](https://arxiv.org/html/2604.19921#S5), both*Valid*and*Invalid*triples improve LLMs’ ability to understand negation\.
## 4Commonsense Knowledge with Negation for Downstream Tasks
We have presented a novel method to automatically augment existing commonsense knowledge corpora with negation\. Our two corpora,¬\\negAtomicand¬\\negAnion, contain over 2M commonsense knowledge triples augmented with negation, validated by an LLM judge\. Beyond this contribution, we demonstrate that our corpora improve LLMs’ ability to understand negation\. Specifically, we evaluate on five benchmarks across three tasks: question answering \(QA\), natural language inference \(NLI\), and information retrieval \(IR\)\.
### 4\.1Benchmarks Evaluating LLMs’ Negation Understanding
CondaQARavichanderet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib17)\)is a contrastive QA dataset requiring understanding of negation cues in passages to answer questions\. Each question is paired with a passage containing the answer, with answers being either*Yes*,*No*,*Don’t know*, a span in the question, or a span in the context\. FollowingRavichanderet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib17)\), we evaluate CondaQA using two metrics:*accuracy*and*group consistency*\. The term*group*refers to the original passage and either all three or one of the edited passages\.*Group consistency*measures the percentage of questions answered correctly for all passages in a group, which is arguably more important than accuracy, as robustness against negation requires correctly answering questions with both original and negated passages\.
NLI with Negationis introduced byHossainet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib25)\)\. It contains three NLI benchmarks developed from existing benchmarks: RTEDaganet al\.\([2005](https://arxiv.org/html/2604.19921#bib.bib33)\), SNLIBowmanet al\.\([2015](https://arxiv.org/html/2604.19921#bib.bib30)\), and MNLIWilliamset al\.\([2018](https://arxiv.org/html/2604.19921#bib.bib32)\)\. The authors show that existing NLI benchmarks contain few negation cues and develop new benchmarks by negating the main verbs in premises and hypotheses to create three new pairs from each original pair\. The new pairs are manually annotated to obtain labels\.
NevIRWelleret al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib26)\)addresses the weakness of neural information retrieval \(IR\) models in understanding negation\. The dataset is constructed using contrastive query\-document pairs from CondaQA, where each pair of queries and documents is nearly identical except for a crucial negation\. An IR model is expected to rank documents based on queries by correctly understanding negation\. Pairwise accuracy serves as the evaluation metric: the model must correctly rank documents for both queries, flipping the ranking when given the negated query\. Although the dataset’s primary purpose is to evaluate IR model performance, it is essentially a binary classification task, as only two documents are provided for each query\. Still, it is considered a challenging benchmark as models are required to understand negation from long context \(documents\)\.
### 4\.2Training LLMs with Commonsense Knowledge with Negation
Although over half of our generated triples are identified asInvalid\(Table[2](https://arxiv.org/html/2604.19921#S3.T2)\), meaning they conflict with commonsense knowledge, they remain useful for teaching LLMs to understand negation\. Specifically, we design a training objective that enables models to learn from bothValidandInvalidtriples, and how negating the*if*or*then*event affects triple validity\. We construct training corpora by selecting contrastive triples from¬\\negAtomicfollowing the patterns below\. We select triples if:
- •the original triple <A,R,B\> is*Valid*, <¬\\negA,R,B\> is*Invalid*, and <A,R,¬\\negB\> is*Valid*;
- •the original triple <A,R,B\> is*Valid*, <¬\\negA,R,B\> is*Valid*, and <A,R,¬\\negB\> is*Invalid*;
- •the original triple <A,R,B\> is*Invalid*, <¬\\negA,R,B\> is*Valid*, and <A,R,¬\\negB\> is*Invalid*; or
- •the original triple <A,R,B\> is*Invalid*, <¬\\negA,R,B\> is*Invalid*, and <A,R,¬\\negB\> is*Valid*\.
For¬\\negAnion, we select triples where the original and negated triples have different labels\.
Group Consistency\# Params\.Accuracy\(Δ\\Delta%\)AllPar\.Sco\.Aff\.Fully supervisedUnifiedQA\-v2\-largeRavichanderet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib17)\)770M66\.730\.264\.043\.746\.5RoBERTa\-large355M64\.929\.661\.941\.445\.8Pre\-trained withAtomic\+Anion67\.0\(\+3\.2\)32\.965\.646\.348\.1Best of \(¬\\negAtomic,¬\\negAnion\)68\.5\(\+5\.5\)34\.366\.447\.650\.1In\-context learningZero\-shotOpenAI o1—65\.324\.967\.443\.838\.6Few\-shotInstructGPT \+ COTRavichanderet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib17)\)—66\.327\.364\.245\.144\.9GPT\-4o—72\.934\.478\.752\.647\.9Llama 3\.1 70B70B77\.543\.383\.761\.854\.9Llama 3\.1 8B8B68\.731\.569\.048\.144\.4Pre\-trained withAtomic\+Anion67\.6\(\-1\.6\)27\.569\.544\.341\.4Best of \(¬\\negAtomic,¬\\negAnion\)71\.5\(\+4\.1\)∗33\.572\.948\.746\.4Qwen2 7B7B65\.124\.965\.939\.739\.2Pre\-trained withAtomic\+Anion64\.1\(\-1\.5\)22\.865\.238\.437\.7Best of \(¬\\negAtomic,¬\\negAnion\)69\.7\(\+7\.1\)∗32\.771\.246\.345\.5Table 3:Results evaluating CondaQA with two settings: fully supervised \(top\) and in\-context learning \(bottom\)\. The best result for each model is in bold\. Delta \(Δ\\Delta\) indicates the percent change in accuracy compared to the off\-the\-shelf model\. An asterisk∗indicates a statistically significant improvement \(McNemar’s testMcNemar \([1947](https://arxiv.org/html/2604.19921#bib.bib35)\),p<0\.05p<0\.05\) over both the off\-the\-shelf model and the model trained with existing corpora\.This approach balances the distribution of*Valid*and*Invalid*triples and, more importantly, highlights two patterns for models to compare: \(1\) negating either the*if*or*then*event keeps the new triple’s validity \(either*Valid*or*Invalid*\), and \(2\) negating either the*if*or*then*event flips the new triple’s validity\. We do not include <¬\\negA,R,¬\\negB\> as double negations result in more complex semantics\. The final training set consists of 89k triples from¬\\negAtomicand 76k triples from¬\\negAnion\. Finally, the model is trained to predict whether a triple is*Valid*or*Invalid*, using the validation results \(Section[3\.1](https://arxiv.org/html/2604.19921#S3.SS1)\) as ground truth\.
As a baseline, we also construct a training dataset without augmented negation\. This is done by sampling the commonsense triples fromAtomicandAnionas*Valid*instances, while reusing the*Invalid*triples synthesized by an LLM \(Section[3\.1](https://arxiv.org/html/2604.19921#S3.SS1)\)\. AlthoughAnioncontains triples with negated*if*events, they differ significantly from our corpora as they do not preserve the original*then*event\. Note that we always sample the same number of triples fromAtomicandAnionas¬\\negAtomicand¬\\negAnion\.
## 5Experiments
We evaluate our corpora on negation understanding using an encoder\-based model \(RoBERTa\-large\) and two LLMs \(Llama 3\.1 8B and Qwen2 7B\) across three tasks: \(1\) CondaQA, a QA task \(Section[5\.1](https://arxiv.org/html/2604.19921#S5.SS1)\); \(2\) NLI with Negation, an NLI task \(Section[5\.2](https://arxiv.org/html/2604.19921#S5.SS2)\); and \(3\) NevIR, an IR task \(Section[5\.3](https://arxiv.org/html/2604.19921#S5.SS3)\)\.
All three models are first pre\-trained on our commonsense corpora,¬\\negAtomic,¬\\negAnion, or both\. As a baseline, models are pre\-trained on existing corpora \(AtomicandAnion\)\. We then evaluate on downstream tasks in two settings: \(1\) fully\-supervised fine\-tuning on the downstream task’s training split for encoder\-based models, and \(2\) zero\-shot and few\-shot in\-context learning for LLMs\. Following standard practices, we use a zero\-shot prompt with the OpenAI o1 model and a few\-shot prompt with GPT\-4o and open\-source LLMs\. Pre\-trained LLMs are trained and evaluated locally using QLoRADettmerset al\.\([2023](https://arxiv.org/html/2604.19921#bib.bib15)\)due to computational resource limitations\. Appendix[C](https://arxiv.org/html/2604.19921#A3)reports experimental details including the prompts and hyperparameters\.
\# Params\.NLI with NegationNevIRRTE\-NegSNLI\-NegMNLI\-NegFully supervisedBERTNOTHosseiniet al\.\([2021](https://arxiv.org/html/2604.19921#bib.bib27)\)110M74\.546\.060\.9—RoBERTa\-large\-NSPRezaei and Blanco \([2025](https://arxiv.org/html/2604.19921#bib.bib28)\)355M87\.256\.569\.9—stsb\-roberta\-largeWelleret al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib26)\)355M———24\.9MonoT5 3BNogueiraet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib24)\)3B———50\.6RoBERTa\-large355M84\.756\.069\.924\.5Pre\-trained withAtomic\+Anion86\.256\.569\.429\.1Best of \(¬\\negAtomic,¬\\negAnion\)88\.1∗58\.369\.734\.3∗In\-context learningZero\-shotOpenAI o1—87\.575\.974\.659\.7Few\-shotGPT\-4o—86\.974\.875\.061\.7Llama 3\.1 70B70B78\.969\.165\.958\.8Llama 3\.1 8B8B60\.054\.447\.030\.6Pre\-trained withAtomic\+Anion65\.357\.551\.137\.9Best of \(¬\\negAtomic,¬\\negAnion\)81\.3∗68\.2∗63\.9∗42\.2∗Qwen2 7B7B71\.760\.759\.529\.8Pre\-trained withAtomic\+Anion78\.366\.563\.033\.8Best of \(¬\\negAtomic,¬\\negAnion\)82\.3∗72\.4∗67\.5∗39\.8∗Table 4:Results evaluating three NLI benchmarks with negation cues \(accuracy\) and NevIR \(pairwise accuracy\) in two settings: \(1\) fully supervised \(top\) and \(2\) in\-context learning \(bottom\)\. The best results for each model are in bold\. An asterisk∗indicates a statistically significant improvement \(McNemar’s testMcNemar \([1947](https://arxiv.org/html/2604.19921#bib.bib35)\),p<0\.05p<0\.05\) over both the off\-the\-shelf model and the model trained with existing corpora\.### 5\.1Evaluating with CondaQA
Table[3](https://arxiv.org/html/2604.19921#S4.T3)reports results on CondaQA using two metrics:*accuracy*and*group consistency*Ravichanderet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib17)\)\. For pre\-trained models, we report only the best results among three corpus configurations\. Appendix[D\.1](https://arxiv.org/html/2604.19921#A4.SS1)provides the complete results\.
RoBERTa\-large benefits from pre\-training on both existing and our commonsense corpora, with our corpora yielding higher performance\. Pre\-training on our corpora also outperforms UnifiedQA\-v2\-largeRavichanderet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib17)\), despite the latter being a larger model \(770M vs\. 355M\)\. Surprisingly, proprietary models \(OpenAI o1 and GPT\-4o\) yield worse results than the less powerful open\-source Llama 3\.1 70B \(65\.3 vs\. 72\.9 vs\. 77\.5\)\. We hypothesize that OpenAI o1 with a few\-shot prompt may achieve higher performance, though at significantly greater token cost\. The two LLMs pre\-trained on our corpora consistently outperform the base model, with Llama 3\.1 8B outperforming Qwen2 7B\. Pre\-trained Llama 3\.1 8B even achieves competitive performance with GPT\-4o \(71\.5 vs\. 72\.9\)\. More importantly, most improvements from our corpora are statistically significant over both the off\-the\-shelf baseline and models trained with existing corpora \(indicated with∗in Table[3](https://arxiv.org/html/2604.19921#S4.T3)\)\. In contrast, pre\-training on existing corpora yields worse results than the off\-the\-shelf baseline\. Note that the baseline with existing corpora uses a single configuration:Validtriples fromAtomicandAnion, and synthesizedInvalidtriples\.
Importantly, pre\-training on our corpora does not degrade performance with non\-negation instances\. The Affirmative Edit setting in CondaQA \(column Aff\.\) requires the model to reason over passages with negation removed, and we observe no drop in performance, indicating that pre\-trained models remain capable of handling affirmative text\. Appendix[D\.2](https://arxiv.org/html/2604.19921#A4.SS2)provides extra results with CommonsenseQATalmoret al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib18)\), a standard commonsense benchmark, further confirming this finding\.
CondaQANLI with NegationNevIR\(Accuracy\)RTE\-NegSNLI\-NegMNLI\-NegLlama 3\.1 8B68\.760\.054\.447\.030\.6Pre\-trained with¬\\negAtomic70\.681\.268\.563\.838\.1<¬\\negA,R,B\>69\.376\.864\.053\.534\.3<A,R,¬\\negB\>67\.674\.065\.554\.635\.6¬\\negAnion71\.381\.368\.263\.942\.2<¬\\negA,R,¬\\negB\>68\.770\.262\.951\.635\.3Qwen2 7B65\.171\.760\.759\.529\.8Pre\-trained with¬\\negAtomic69\.782\.372\.467\.539\.8<¬\\negA,R,B\>65\.874\.165\.565\.135\.8<A,R,¬\\negB\>64\.375\.768\.362\.434\.4¬\\negAnion69\.278\.966\.162\.536\.3<¬\\negA,R,¬\\negB\>65\.471\.054\.352\.530\.1Table 5:Ablation results evaluating CondaQA, NLI, and NevIR, with pre\-trained LLMs on different types of negations:ifevent,thenevent, or both\.
### 5\.2Evaluating with NLI Benchmarks
We further evaluate on three NLI benchmarks with negation: RTE\-Neg, SNLI\-Neg, and MNLI\-Neg\. We include additional fully\-supervised baselines: BERTNOTHosseiniet al\.\([2021](https://arxiv.org/html/2604.19921#bib.bib27)\), RoBERTa\-large\-NSPRezaei and Blanco \([2025](https://arxiv.org/html/2604.19921#bib.bib28)\), stsb\-roberta\-largeWelleret al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib26)\), and MonoT5 3BNogueiraet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib24)\)\. Table[4](https://arxiv.org/html/2604.19921#S5.T4)\(NLI with Negation\) reports results using*accuracy*\. Pre\-training RoBERTa\-large on our corpora yields the best results among all models and outperforms all baselines\. However, only one result yields statistically significant improvement over the off\-the\-shelf baseline\. We hypothesize this is due to potential overfitting—models are further fine\-tuned on each benchmark’s training split when the tasks are as simple as classification\. Moreover, the off\-the\-shelf models already achieve strong results\.
In\-context learning with LLMs shows more expected trends\. The OpenAI o1 model achieves the highest results overall, followed by GPT\-4o\. LLMs pre\-trained on our corpora demonstrate statistically significant improvements over both the off\-the\-shelf baseline and models pre\-trained on existing corpora across all three benchmarks\. Notably, Qwen2 7B even outperforms the larger Llama 3\.1 70B model\. Unlike CondaQA, pre\-training on existing corpora also yields benefits, despite being significantly smaller than our corpora\. Note that the existing corpora include negated triples fromAnion, which benefit models’ negation understanding on the simpler NLI task\.
\# TriplesCondaQANLI with NegationNevIR\(Accuracy\)RTE\-NegSNLI\-NegMNLI\-NegLlama 3\.1 8Bn/a68\.760\.054\.447\.030\.6Pre\-trained with¬\\negAtomic\+¬\\negAnion1K \+ 1K67\.770\.857\.153\.531\.810K \+ 10K69\.381\.067\.762\.934\.589K \+ 76K71\.581\.367\.963\.738\.2Qwen2 7Bn/a65\.171\.760\.759\.529\.8Pre\-trained with¬\\negAtomic\+¬\\negAnion1K \+ 1K65\.276\.065\.263\.134\.910K \+ 10K66\.681\.772\.267\.436\.089K \+ 76K66\.682\.172\.267\.538\.5Table 6:Ablation on training dataset size\. We train Llama 3\.1 8B and Qwen2 7B with two subsets of¬\\negAtomic\+¬\\negAnion: \(1\) 2K triples \(1K from each dataset\) and \(2\) 20K triples \(10K each\)\. Training with the 2K subset yields significantly worse performance than the complete dataset, while training with the 20K subset shows substantial improvement and tends to saturate on simpler tasks such as NLI\.
### 5\.3Evaluating with NevIR
FollowingWelleret al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib26)\), we perform fully\-supervised fine\-tuning with RoBERTa\-large on STS\-BCeret al\.\([2017](https://arxiv.org/html/2604.19921#bib.bib39)\)instead of NevIR’s training split\. Table[4](https://arxiv.org/html/2604.19921#S5.T4)\(NevIR\) reports the results using pairwise accuracy—the model needs to correctly rank documents for both queries \(with and without negation\)\. Although NevIR is a retrieval task, it only requires ranking between two documents for each query and essentially becomes binary classification\. In fact, RoBERTa\-large pre\-trained on our corpora outperforms a customized IR model baseline \(stsb\-roberta\-largeWelleret al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib26)\), 34\.3 vs\. 24\.9\)\. Again, pre\-training on our corpora yields statistically significant improvements over off\-the\-shelf models and models trained with existing corpora\. The results are consistent across all three models\. All our models significantly underperform OpenAI o1, GPT\-4o, and even MonoT5 3BNogueiraet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib24)\)\. We hypothesize that pre\-training a retrieval model with our corpora as a starting point would yield greater benefits\.
### 5\.4Ablation Studies
#### Pre\-training with Individual Negation Types
We further investigate whether training with individual negation types contributes differently to performance improvements\. For¬\\negAtomic, we compare the complete corpus against training with triples that only negate the*if*event \(<¬\\negA,R,B\>\) or the*then*event \(<A,R,¬\\negB\>\)\. For¬\\negAnion, we compare against training with triples that negate both events \(<¬\\negA,R,¬\\negB\>\)\.
Table[5](https://arxiv.org/html/2604.19921#S5.T5)reports the results with two LLMs\. Note that results with the complete corpus are not directly comparable to Table[3](https://arxiv.org/html/2604.19921#S4.T3)and[4](https://arxiv.org/html/2604.19921#S5.T4), which report only the best configuration\. Training with individual negation types consistently underperforms the complete corpus across all benchmarks, demonstrating that the patterns within our corpora are critical \(Section[4\.2](https://arxiv.org/html/2604.19921#S4.SS2)\)\. Notably, negating the*if*event alone yields higher results than the off\-the\-shelf model, suggesting that certain negation types still provide partial benefits for negation understanding\.
For¬\\negAnion, training with triples negating both events yields substantially worse performance than the complete corpus\. We hypothesize that double negation alone is insufficient, as models benefit from first learning simpler single\-negation patterns before generalizing to more complex negations\.
#### Pre\-training with Different Data Sizes
Our full training dataset consists of 89k triples from¬\\negAtomicand 76k triples from¬\\negAnion\(Section[4\.2](https://arxiv.org/html/2604.19921#S4.SS2)\)\. We conduct two ablations on the effect of training data size on downstream task performance: \(1\) training with a 2,000\-triple subset \(randomly sampled and balanced with 1,000 triples each from¬\\negAtomicand¬\\negAnion\), and \(2\) training with a 20,000\-triple subset \(10,000 each\)\. Table[6](https://arxiv.org/html/2604.19921#S5.T6)reports the results compared to off\-the\-shelf models and models trained with the complete dataset\. Similarly, Table[3](https://arxiv.org/html/2604.19921#S4.T3)and Table[4](https://arxiv.org/html/2604.19921#S5.T4)only report the best result among¬\\negAtomic,¬\\negAnion, or both\. Appendix[D](https://arxiv.org/html/2604.19921#A4)reports comparable results with \[¬\\negAtomic\+¬\\negAnion\]\.
Training with only 2K triples already improves over the off\-the\-shelf models on NLI tasks, though the gains are modest\. Scaling to 20K triples yields substantial improvements, with NLI performance approaching that of the full dataset \(e\.g\., RTE improves from 70\.8 to 81\.0 for Llama 3\.1 8B, compared to 81\.3 with the full training set\)\. However, NevIR continues to benefit from additional training data beyond 20K \(34\.5→\\rightarrow38\.2 for Llama 3\.1 8B, 36\.0→\\rightarrow38\.5 for Qwen2 7B\), suggesting that more complex negation reasoning tasks benefit from larger training sets\.
#### Pre\-training with Randomly Labelled Data
We further validate the importance of our LLM judge by training models on randomly labelled data\. Training with randomly labelled data substantially degrades performance compared to LLM\-validated data, particularly on NevIR where performance drops well below the off\-the\-shelf baseline\. Due to space limitations, we report the full results and analyses in Appendix[D\.3](https://arxiv.org/html/2604.19921#A4.SS3); Appendix[E](https://arxiv.org/html/2604.19921#A5)further provides an error analysis categorizing improvements by negation typeHossainet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib25)\)and case studies illustrating specific reasoning patterns improved by our approach\.
## 6Conclusion
Negation and commonsense knowledge are both common and important in human language\. Previous work shows that models struggle when negation appears in natural language understanding tasks\. However, few works have investigated commonsense knowledge with negation\.
We present an approach to automatically augment existing commonsense knowledge corpora with negation, contributing over 2M commonsense knowledge triples with negation\. We show that pre\-training models with our corpora is beneficial in understanding negation\. This holds true across three models and five benchmarks\. We further conduct ablation studies and analyses that provide additional evidence and insights into the performance improvements\.
## Limitations
We work on two existing commonsense knowledge corpora with limited focus—they only contain*if\-then*relations\. It would be more comprehensive to investigate commonsense knowledge in different forms\. In addition, we only consider the logical negation cue*not*\. Future work should consider various negation cues, including semantic negation\.
We only experiment with two relatively small LLMs \(a 7B and an 8B model\) to study the benefit of our corpora for improving negation understanding\. Due to limited computational resources, we choose not to conduct experiments with larger LLMs \(e\.g\., 70B\)\. Moreover, the LLMs are trained and evaluated with 4\-bit quantization due to the same resource limitation\.
NevIR was developed to evaluate information retrieval tasks involving negation\. We experiment with NevIR as a classification task using either classifier models or general\-purpose LLMs\. Future work should consider models specifically designed for information retrieval\.
## Ethics Statement
Data Sources and CollectionWe collect the datasets \(Atomic,Anion, CondaQA, NLI with Negation, and NevIR\) via links provided by the authors\.
Data fromAtomicSapet al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib4)\)is used under the Creative Commons Attribution 4\.0 International License; CondaQARavichanderet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib17)\)under Apache\-2\.0 License; NLI with NegationHossainet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib25)\)and NevIRWelleret al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib26)\)under MIT License\.AnionJianget al\.\([2021](https://arxiv.org/html/2604.19921#bib.bib14)\)does not specify its license\.
## Acknowledgments
We thank the reviewers for their insightful comments\.
The OpenAI Researcher Access Program provided credits to conduct this research\.
This material is based upon work supported by the National Science Foundation under Grant No\. 2310334\. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF\.
## References
- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2604.19921#S1.p1.1)\.
- UnCommonSense: informative negative knowledge about everyday concepts\.InProceedings of the 31st ACM International Conference on Information & Knowledge Management,CIKM ’22,New York, NY, USA,pp\. 37–46\.External Links:ISBN 9781450392365,[Link](https://doi.org/10.1145/3511808.3557484),[Document](https://dx.doi.org/10.1145/3511808.3557484)Cited by:[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Artstein and M\. Poesio \(2008\)Inter\-coder agreement for computational linguistics\.Computational Linguistics34\(4\),pp\. 555–596\.External Links:ISSN 0891\-2017,[Document](https://dx.doi.org/10.1162/coli.07-034-R2),[Link](https://doi.org/10.1162/coli.07-034-R2),https://direct\.mit\.edu/coli/article\-pdf/34/4/555/1808947/coli\.07\-034\-r2\.pdfCited by:[§3\.1](https://arxiv.org/html/2604.19921#S3.SS1.p3.1)\.
- A\. Bosselut, H\. Rashkin, M\. Sap, C\. Malaviya, A\. Celikyilmaz, and Y\. Choi \(2019\)COMET: commonsense transformers for automatic knowledge graph construction\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 4762–4779\.External Links:[Link](https://aclanthology.org/P19-1470/),[Document](https://dx.doi.org/10.18653/v1/P19-1470)Cited by:[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px1.p1.1)\.
- S\. R\. Bowman, G\. Angeli, C\. Potts, and C\. D\. Manning \(2015\)A large annotated corpus for learning natural language inference\.InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,L\. Màrquez, C\. Callison\-Burch, and J\. Su \(Eds\.\),Lisbon, Portugal,pp\. 632–642\.External Links:[Link](https://aclanthology.org/D15-1075/),[Document](https://dx.doi.org/10.18653/v1/D15-1075)Cited by:[§4\.1](https://arxiv.org/html/2604.19921#S4.SS1.p2.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2604.19921#S1.p1.1)\.
- D\. Cer, M\. Diab, E\. Agirre, I\. Lopez\-Gazpio, and L\. Specia \(2017\)SemEval\-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation\.InProceedings of the 11th International Workshop on Semantic Evaluation \(SemEval\-2017\),S\. Bethard, M\. Carpuat, M\. Apidianaki, S\. M\. Mohammad, D\. Cer, and D\. Jurgens \(Eds\.\),Vancouver, Canada,pp\. 1–14\.External Links:[Link](https://aclanthology.org/S17-2001/),[Document](https://dx.doi.org/10.18653/v1/S17-2001)Cited by:[§C\.3](https://arxiv.org/html/2604.19921#A3.SS3.p1.1),[§5\.3](https://arxiv.org/html/2604.19921#S5.SS3.p1.1)\.
- I\. Dagan, O\. Glickman, and B\. Magnini \(2005\)The pascal recognising textual entailment challenge\.InProceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment,MLCW’05,Berlin, Heidelberg,pp\. 177–190\.External Links:ISBN 3540334270,[Link](https://doi.org/10.1007/11736790_9),[Document](https://dx.doi.org/10.1007/11736790%5F9)Cited by:[§4\.1](https://arxiv.org/html/2604.19921#S4.SS1.p2.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)Qlora: efficient finetuning of quantized llms\.Advances in neural information processing systems36,pp\. 10088–10115\.Cited by:[Appendix B](https://arxiv.org/html/2604.19921#A2.p3.1),[Appendix C](https://arxiv.org/html/2604.19921#A3.p2.1),[§3\.1](https://arxiv.org/html/2604.19921#S3.SS1.p4.1),[§5](https://arxiv.org/html/2604.19921#S5.p2.2)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423/),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§1](https://arxiv.org/html/2604.19921#S1.p1.1)\.
- R\. Dobreva and F\. Keller \(2021\)Investigating negation in pre\-trained vision\-and\-language models\.InProceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP,J\. Bastings, Y\. Belinkov, E\. Dupoux, M\. Giulianelli, D\. Hupkes, Y\. Pinter, and H\. Sajjad \(Eds\.\),Punta Cana, Dominican Republic,pp\. 350–362\.External Links:[Link](https://aclanthology.org/2021.blackboxnlp-1.27/),[Document](https://dx.doi.org/10.18653/v1/2021.blackboxnlp-1.27)Cited by:[§1](https://arxiv.org/html/2604.19921#S1.p1.1),[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Fang, Q\. V\. Do, H\. Zhang, Y\. Song, G\. Y\. Wong, and S\. See \(2022\)PseudoReasoner: leveraging pseudo labels for commonsense knowledge base population\.InFindings of the Association for Computational Linguistics: EMNLP 2022,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 3379–3394\.External Links:[Link](https://aclanthology.org/2022.findings-emnlp.246/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-emnlp.246)Cited by:[§3\.1](https://arxiv.org/html/2604.19921#S3.SS1.p3.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[Appendix C](https://arxiv.org/html/2604.19921#A3.p2.1)\.
- J\. Guan, F\. Huang, Z\. Zhao, X\. Zhu, and M\. Huang \(2020\)A knowledge\-enhanced pretraining model for commonsense story generation\.Transactions of the Association for Computational Linguistics8,pp\. 93–108\.External Links:[Link](https://aclanthology.org/2020.tacl-1.7/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00302)Cited by:[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px2.p1.1)\.
- M\. M\. Hossain, V\. Kovatchev, P\. Dutta, T\. Kao, E\. Wei, and E\. Blanco \(2020\)An analysis of natural language inference benchmarks through the lens of negation\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 9106–9118\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.732/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.732)Cited by:[§E\.1](https://arxiv.org/html/2604.19921#A5.SS1.SSS0.Px4.p1.1),[§E\.1](https://arxiv.org/html/2604.19921#A5.SS1.p1.1),[§E\.2](https://arxiv.org/html/2604.19921#A5.SS2.p1.1),[Table 20](https://arxiv.org/html/2604.19921#A5.T20),[Table 21](https://arxiv.org/html/2604.19921#A5.T21),[§1](https://arxiv.org/html/2604.19921#S1.p1.1),[§4\.1](https://arxiv.org/html/2604.19921#S4.SS1.p2.1),[§5\.4](https://arxiv.org/html/2604.19921#S5.SS4.SSS0.Px3.p1.1),[Ethics Statement](https://arxiv.org/html/2604.19921#Sx2.p2.1)\.
- A\. Hosseini, S\. Reddy, D\. Bahdanau, R\. D\. Hjelm, A\. Sordoni, and A\. Courville \(2021\)Understanding by understanding not: modeling negation in language models\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tur, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),Online,pp\. 1301–1312\.External Links:[Link](https://aclanthology.org/2021.naacl-main.102/),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.102)Cited by:[Table 17](https://arxiv.org/html/2604.19921#A4.T17.32.36.1),[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px3.p1.1),[§5\.2](https://arxiv.org/html/2604.19921#S5.SS2.p1.1),[Table 4](https://arxiv.org/html/2604.19921#S5.T4.16.20.1)\.
- M\. Jang, F\. Mtumbuka, and T\. Lukasiewicz \(2022\)Beyond distributional hypothesis: let language models learn meaning\-text correspondence\.InFindings of the Association for Computational Linguistics: NAACL 2022,M\. Carpuat, M\. de Marneffe, and I\. V\. Meza Ruiz \(Eds\.\),Seattle, United States,pp\. 2030–2042\.External Links:[Link](https://aclanthology.org/2022.findings-naacl.156/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-naacl.156)Cited by:[§1](https://arxiv.org/html/2604.19921#S1.p1.1),[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Jiang, A\. Bosselut, C\. Bhagavatula, and Y\. Choi \(2021\)“I‘m not mad”: commonsense implications of negation and contradiction\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tur, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),Online,pp\. 4380–4397\.External Links:[Link](https://aclanthology.org/2021.naacl-main.346/),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.346)Cited by:[§1](https://arxiv.org/html/2604.19921#S1.p3.2),[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2604.19921#S3.p1.2),[Ethics Statement](https://arxiv.org/html/2604.19921#Sx2.p2.1)\.
- Y\. K\. Lal, N\. Tandon, T\. Aggarwal, H\. Liu, N\. Chambers, R\. Mooney, and N\. Balasubramanian \(2022\)Using commonsense knowledge to answer why\-questions\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 1204–1219\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.79/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.79)Cited by:[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Li, B\. Jiang, L\. Huang, A\. Beigi, C\. Zhao, Z\. Tan, A\. Bhattacharjee, Y\. Jiang, C\. Chen, T\. Wu,et al\.\(2024\)From generation to judgment: opportunities and challenges of llm\-as\-a\-judge\.arXiv preprint arXiv:2411\.16594\.Cited by:[§3\.1](https://arxiv.org/html/2604.19921#S3.SS1.p2.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)Roberta: a robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[Appendix C](https://arxiv.org/html/2604.19921#A3.p1.1)\.
- Q\. McNemar \(1947\)Note on the sampling error of the difference between correlated proportions or percentages\.Psychometrika12\(2\),pp\. 153–157\.Cited by:[Table 16](https://arxiv.org/html/2604.19921#A4.T16),[Table 3](https://arxiv.org/html/2604.19921#S4.T3),[Table 4](https://arxiv.org/html/2604.19921#S5.T4)\.
- R\. Nogueira, Z\. Jiang, R\. Pradeep, and J\. Lin \(2020\)Document ranking with a pretrained sequence\-to\-sequence model\.InFindings of the Association for Computational Linguistics: EMNLP 2020,T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 708–718\.External Links:[Link](https://aclanthology.org/2020.findings-emnlp.63/),[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.63)Cited by:[Table 17](https://arxiv.org/html/2604.19921#A4.T17.32.39.1),[§5\.2](https://arxiv.org/html/2604.19921#S5.SS2.p1.1),[§5\.3](https://arxiv.org/html/2604.19921#S5.SS3.p1.1),[Table 4](https://arxiv.org/html/2604.19921#S5.T4.16.23.1)\.
- A\. Ravichander, M\. Gardner, and A\. Marasovic \(2022\)CONDAQA: a contrastive reading comprehension dataset for reasoning about negation\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 8729–8755\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.598/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.598)Cited by:[Table 10](https://arxiv.org/html/2604.19921#A2.T10),[§C\.1](https://arxiv.org/html/2604.19921#A3.SS1.p2.1),[Table 16](https://arxiv.org/html/2604.19921#A4.T16.19.22.1),[Table 16](https://arxiv.org/html/2604.19921#A4.T16.19.32.1),[§4\.1](https://arxiv.org/html/2604.19921#S4.SS1.p1.1),[Table 3](https://arxiv.org/html/2604.19921#S4.T3.9.12.1),[Table 3](https://arxiv.org/html/2604.19921#S4.T3.9.20.1),[§5\.1](https://arxiv.org/html/2604.19921#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2604.19921#S5.SS1.p2.1),[Ethics Statement](https://arxiv.org/html/2604.19921#Sx2.p2.1)\.
- M\. Rezaei and E\. Blanco \(2024\)Paraphrasing in affirmative terms improves negation understanding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 602–615\.External Links:[Link](https://aclanthology.org/2024.acl-short.55/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-short.55)Cited by:[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Rezaei and E\. Blanco \(2025\)Making language models robust against negation\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 8123–8142\.External Links:[Link](https://aclanthology.org/2025.naacl-long.413/),ISBN 979\-8\-89176\-189\-6Cited by:[Table 17](https://arxiv.org/html/2604.19921#A4.T17.32.37.1),[§5\.2](https://arxiv.org/html/2604.19921#S5.SS2.p1.1),[Table 4](https://arxiv.org/html/2604.19921#S5.T4.16.21.1)\.
- M\. Sap, R\. Le Bras, E\. Allaway, C\. Bhagavatula, N\. Lourie, H\. Rashkin, B\. Roof, N\. A\. Smith, and Y\. Choi \(2019\)Atomic: an atlas of machine commonsense for if\-then reasoning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.33,pp\. 3027–3035\.Cited by:[§1](https://arxiv.org/html/2604.19921#S1.p2.1),[§1](https://arxiv.org/html/2604.19921#S1.p3.2),[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2604.19921#S3.p1.2),[Ethics Statement](https://arxiv.org/html/2604.19921#Sx2.p2.1)\.
- R\. Singh, R\. Kumar, and V\. Sridhar \(2023\)NLMs: augmenting negation in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 13104–13116\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.873/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.873)Cited by:[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px3.p1.1)\.
- R\. Speer, J\. Chin, and C\. Havasi \(2017\)Conceptnet 5\.5: an open multilingual graph of general knowledge\.InProceedings of the AAAI conference on artificial intelligence,Vol\.31\.Cited by:[§1](https://arxiv.org/html/2604.19921#S1.p2.1),[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)CommonsenseQA: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4149–4158\.External Links:[Link](https://aclanthology.org/N19-1421/),[Document](https://dx.doi.org/10.18653/v1/N19-1421)Cited by:[§D\.2](https://arxiv.org/html/2604.19921#A4.SS2.p1.2),[§1](https://arxiv.org/html/2604.19921#S1.p2.1),[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2604.19921#S5.SS1.p3.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§1](https://arxiv.org/html/2604.19921#S1.p1.1)\.
- O\. Weller, D\. Lawrie, and B\. Van Durme \(2024\)NevIR: negation in neural information retrieval\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Y\. Graham and M\. Purver \(Eds\.\),St\. Julian’s, Malta,pp\. 2274–2287\.External Links:[Link](https://aclanthology.org/2024.eacl-long.139/)Cited by:[§C\.3](https://arxiv.org/html/2604.19921#A3.SS3.p1.1),[Table 17](https://arxiv.org/html/2604.19921#A4.T17.32.38.1),[§4\.1](https://arxiv.org/html/2604.19921#S4.SS1.p3.1),[§5\.2](https://arxiv.org/html/2604.19921#S5.SS2.p1.1),[§5\.3](https://arxiv.org/html/2604.19921#S5.SS3.p1.1),[Table 4](https://arxiv.org/html/2604.19921#S5.T4.16.22.1),[Ethics Statement](https://arxiv.org/html/2604.19921#Sx2.p2.1)\.
- P\. West, C\. Bhagavatula, J\. Hessel, J\. Hwang, L\. Jiang, R\. Le Bras, X\. Lu, S\. Welleck, and Y\. Choi \(2022\)Symbolic knowledge distillation: from general language models to commonsense models\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,M\. Carpuat, M\. de Marneffe, and I\. V\. Meza Ruiz \(Eds\.\),Seattle, United States,pp\. 4602–4625\.External Links:[Link](https://aclanthology.org/2022.naacl-main.341/),[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.341)Cited by:[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Williams, N\. Nangia, and S\. Bowman \(2018\)A broad\-coverage challenge corpus for sentence understanding through inference\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 1112–1122\.External Links:[Link](https://aclanthology.org/N18-1101/),[Document](https://dx.doi.org/10.18653/v1/N18-1101)Cited by:[§4\.1](https://arxiv.org/html/2604.19921#S4.SS1.p2.1)\.
- A\. Yang, B\. Yang, B\. Hui, B\. Zheng, B\. Yu, C\. Zhou, C\. Li, C\. Li, D\. Liu, F\. Huang, G\. Dong, H\. Wei, H\. Lin, J\. Tang, J\. Wang, J\. Yang, J\. Tu, J\. Zhang, J\. Ma, J\. Yang, J\. Xu, J\. Zhou, J\. Bai, J\. He, J\. Lin, K\. Dang, K\. Lu, K\. Chen, K\. Yang, M\. Li, M\. Xue, N\. Ni, P\. Zhang, P\. Wang, R\. Peng, R\. Men, R\. Gao, R\. Lin, S\. Wang, S\. Bai, S\. Tan, T\. Zhu, T\. Li, T\. Liu, W\. Ge, X\. Deng, X\. Zhou, X\. Ren, X\. Zhang, X\. Wei, X\. Ren, X\. Liu, Y\. Fan, Y\. Yao, Y\. Zhang, Y\. Wan, Y\. Chu, Y\. Liu, Z\. Cui, Z\. Zhang, Z\. Guo, and Z\. Fan \(2024\)Qwen2 technical report\.External Links:2407\.10671,[Link](https://arxiv.org/abs/2407.10671)Cited by:[Appendix C](https://arxiv.org/html/2604.19921#A3.p2.1)\.
- W\. Zhao, J\. Chiu, J\. Hwang, F\. Brahman, J\. Hessel, S\. Choudhury, Y\. Choi, X\. Li, and A\. Suhr \(2024\)UNcommonsense reasoning: abductive reasoning about uncommon situations\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 8487–8505\.External Links:[Link](https://aclanthology.org/2024.naacl-long.469/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.469)Cited by:[§2](https://arxiv.org/html/2604.19921#S2.SS0.SSS0.Px1.p1.1)\.
## Appendix ADetails for Creating and Annotating the Benchmark
We create the benchmark by sampling 200 triples per relation \(9 relations in total\) fromAtomic’s test split\. The benchmark includes 7,200 triples augmented with three types of negation per original triple \(negating theifevent,thenevent, or both\)\. We ask two annotators, one graduate student and one with a PhD degree, to validate each triple using three labels:*Valid*,*Invalid*, and*Ambiguous*\. Table[7](https://arxiv.org/html/2604.19921#A2.T7)shows the instructions we provide to annotators\.
## Appendix BDetails for Validating Commonsense Knowledge Triples
We train a task\-specific LLM to validate augmented commonsense triples\. The model is trained with the same amount of data regardless of the corpus source\. Specifically, we sample 1,800 triples per label from the training split of each source corpus \(in total 5,400\)\. For*Valid*and*Ambiguous*triples, we sample 100 triples per relation per corpus if the sources areAtomicandAnion; otherwise, we sample 200 triples per relation from the single corpus\.
For*Invalid*triples, we use GPT\-4o to generate*then*events given sampled*if*events fromAtomicandAnionto construct*Invalid*triples\. Table[8](https://arxiv.org/html/2604.19921#A2.T8)provides the prompt\.
We train the task\-specific LLM judge based on Llama 3\.1 Instruct 8B and 70B using QLoRADettmerset al\.\([2023](https://arxiv.org/html/2604.19921#bib.bib15)\)with 4\-bit quantization, which are hosted locally on two H100 GPUs with a total of 160 GB memory\. Llama 3\.1 8B is trained for 5 epochs with a learning rate of 5e\-6 with 1 hour of training time, while Llama 3\.1 70B is trained for 1 epoch with a learning rate of 2e\-5 with 16 hours of training time\.
Given a triple <A,R,B\> and its three negated variations <¬\\negA,R,B\>, <A,R,¬\\negB\>, and <¬\\negA,R,¬\\negB\>, the annotation task is to determine if each triple aligns with real\-world commonsense knowledge\. We use label*Valid*to represent a triple that always aligns with real\-world commonsense knowledge; label*Invalid*to represent a triple that always conflicts with real\-world commonsense knowledge; and label*Ambiguous*to represent a triple for two cases: \(1\) the interpretation of the triple is ambiguous, meaning it can either align or conflict with real\-world commonsense knowledge, or \(2\) the*if*event does not relate to the*then*event by the relation\.Below are some examples\.*Valid*: If PersonX takes a picture, then PersonX wants to look at the picture\.*Invalid*: If PersonX takes a picture, then PersonX wants to not look at the picture\.*Ambiguous*: If PersonX takes a picture, then PersonX wants to take a nap\.*Ambiguous*: If PersonX opens the window, then PersonX wants to breathe\.Table 7:Annotation instructions for the benchmark\.You are an expert in commonsense reasoning and knowledge generation\. Your task is to generate a then event complementing the given if event so that the if\-then statement conflicts with commonsense knowledge\.These invalid statements should be clearly wrong or illogical based on everyday commonsense knowledge\.Given the following incomplete if\-then statement:Statement: If event, then relation …Generate the then event within a phrase\.
Table 8:Prompt to generate if\-then statements that conflict with commonsense knowledge\. They are consideredInvalidtriples for the validation task and downstream task\.RelationVerbalizationoEffectthe effect of \{object\} isoReactthe reaction of \{object\} isoWant\{object\} wantxAttrthe attribute of PersonX isxEffectthe effect of PersonX isxIntentthe intention of PersonX isxNeedPersonX needsxReactthe reaction of PersonX isxWantPersonX wantsTable 9:The mapping from commonsense knowledge triples with nine*if\-then*relations to natural language statements\. \{object\} indicates the object in the*if*event\.Table[9](https://arxiv.org/html/2604.19921#A2.T9)lists the mapping we use to convert commonsense knowledge triples to natural language*if\-then*statements\. The models are trained and evaluated with the*if\-then*statements instead of the triples\.
You are a helpful assistant\. In this task, you are expected to write answers to questions involving reasoning about negation\.The answer to the question should be "yes", "no", "don’t know", or a phrase in the passage\. Questions can only have one correct answer\.Only output \[YES\], \[NO\], \[DON’T KNOW\] or a short phrase in the passage\.\{4 exemplars sampled from the few\-shot learning split of CondaQA\}Table 10:4\-shot prompts to evaluate LLMs with CondaQA\. We randomly sample 4 exemplars from the few\-shot learning split provided byRavichanderet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib17)\)\.Learning RateBatch SizeEpochsRTE\-neg2e\-5815SNLI\-neg2e\-5323MNLI\-neg2e\-5644Table 11:Hyperparameters for fully\-supervised fine\-tuning with three NLI benchmarks: RTE\-neg, SNLI\-neg, and MNLI\-neg\.### B\.1Additional Results on Validating Commonsense Knowledge Triples
Table[12](https://arxiv.org/html/2604.19921#A2.T12)reports additional results for validating commonsense triples with negation, including F1 for each relation and overall Accuracy \(Acc\)\. Among individual relations, xIntent and xAttr consistently yield higher F1 across models, while oEffect and oReact are more challenging\. Fine\-tuning withAtomicgenerally outperformsAnionas the source ofValidandAmbiguoustraining data\.
oEffectoReactoWantxAttrxEffectxIntentxNeedxReactxWantAll RelationsF1AccFew\-shot learningLlama 3\.1 8B0\.340\.380\.340\.390\.350\.500\.310\.340\.360\.370\.44Llama 3\.1 70B0\.480\.450\.530\.560\.560\.560\.420\.520\.570\.530\.53GPT\-4o0\.520\.460\.540\.540\.440\.570\.510\.540\.570\.520\.54Claude Sonnet 40\.400\.370\.570\.660\.590\.610\.560\.540\.630\.560\.56Fine\-tuningLlama 3\.1 8B withAtomic0\.510\.570\.490\.630\.530\.620\.500\.530\.520\.550\.56Anion0\.520\.560\.500\.590\.510\.590\.450\.560\.500\.530\.55Atomic\+Anion0\.380\.440\.460\.570\.470\.560\.480\.510\.470\.480\.51Llama 3\.1 70B withAtomic0\.590\.600\.610\.720\.620\.690\.590\.620\.630\.630\.64Anion0\.540\.540\.560\.680\.510\.600\.430\.500\.550\.550\.56Atomic\+Anion0\.560\.580\.610\.710\.600\.660\.570\.620\.630\.620\.62Table 12:Complete results of validating commonsense triples with negation\. We report F1 for each relation and overall, along with overall Accuracy \(Acc\)\. We further report the complete fine\-tuning results across variants that differ in whetherValidandAmbiguousinstances come fromAtomic,Anion, or both\. The results complement Table[1](https://arxiv.org/html/2604.19921#S2.T1)in the main paper\.
## Appendix CExperimental Details for Downstream Tasks
We train three models with either existing corpora or our commonsense knowledge corpora\. RoBERTa\-largeLiuet al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib13)\)is further fine\-tuned with the training split of the specific downstream task, using a batch size of 128, a learning rate of 1e\-6, and early stopping with patience of 3 epochs and a maximum of 5 epochs\.
For the two LLMs \(Llama 3\.1 8BGrattafioriet al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib12)\)and Qwen2 7BYanget al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib11)\)\), we adopt a standard instruction fine\-tuning paradigm, where the training input is formatted as: \[instruction, verbalized commonsense triple, output label\], and the model is trained using the causal language modeling objective\. Specifically, they are trained for 2 epochs using a batch size of 16; Llama 3\.1 8B uses a learning rate of 5e\-6 and Qwen2 7B uses 2e\-5\. We also use QLoRADettmerset al\.\([2023](https://arxiv.org/html/2604.19921#bib.bib15)\)with 4\-bit quantization for both models\. Each model takes approximately 4 hours to train on one H100 GPU with 80 GB memory\. They are further evaluated using few\-shot prompting\. GPT\-4o is called via OpenAI’s API and Claude Sonnet 4 is called via the AWS Bedrock API\.
### C\.1Experimental Details for CondaQA
For fully\-supervised evaluation, we train LLMs with CondaQA’s training split using a batch size of 8 and a learning rate of 1e\-5\. We use early stopping with a patience of 3 epochs and a maximum of 5 epochs\.
Table[10](https://arxiv.org/html/2604.19921#A2.T10)reports the 4\-shot prompts for evaluating LLMs with CondaQA\. We reuse the prompts with minimal edits and sample 4 exemplars from the few\-shot learning split provided byRavichanderet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib17)\)\.
### C\.2Experimental Details for NLI Benchmarks
Table[11](https://arxiv.org/html/2604.19921#A2.T11)reports the hyperparameters for fully\-supervised fine\-tuning LLMs with NLI datasets\.
Table[13](https://arxiv.org/html/2604.19921#A3.T13)reports the 4\-shot prompts used to evaluate LLMs with RTE\-neg dataset, and Table[14](https://arxiv.org/html/2604.19921#A3.T14)reports the 4\-shot prompts for evaluating with MNLI\-neg and SNLI\-neg datasets\. All exemplars are sampled from their training split\.
You are a helpful assistant\. You are given a pair of sentences: a premise and a hypothesis\. Your task is to determine the relationship between the two sentences\.\#\# \[entailment\]: The premise guarantees the truth of the hypothesis\.\#\# \[not\_entailment\]: The premise does not guarantee the truth of the hypothesis\.Format your answer by only outputting \[entailment\] or \[not\_entailment\]\.\#\# Premise: Edward VIII became King in January of 1936 and abdicated in December\.\#\# Hypothesis: King Edward VIII abdicated in December 1936\.\#\# Response: \[entailment\]\#\# Premise: Oil prices fall back as Yukos oil threat lifted\.\#\# Hypothesis: Oil prices rise\.\#\# Response: \[not\_entailment\]\#\# Premise: World Bank programs have been heavily criticized for many years for resulting in poverty\.\#\# Hypothesis: The World Bank is criticized for its activities\.\#\# Response: \[entailment\]\#\# Premise: The cost of the consumer of the United States fell in June\.\#\# Hypothesis: U\.S\. consumer spending dived in June\.\#\# Response: \[not\_entailment\]Table 13:4\-shot prompts for evaluating LLMs with RTE\-neg dataset\. The exemplars are chosen from the training split\.You are a helpful assistant\. You are given a pair of sentences: a premise and a hypothesis\. Your task is to determine the relationship between the two sentences\.\#\# \[entailment\]: The hypothesis is definitely true given the premise\.\#\# \[contradiction\]: The hypothesis is definitely false given the premise\.\#\# \[neutral\]: It is not possible to determine whether the hypothesis is true or false just from the premise\.Format your answer by only outputting \[entailment\], \[contradiction\], or \[neutral\]\.\#\# Premise: One of our number will carry out your instructions minutely\.\#\# Hypothesis: A member of my team will execute your orders with immense precision\.\#\# Response: \[entailment\]\#\# Premise: Fun for adults and children\.\#\# Hypothesis: Fun for only children\.\#\# Response: \[contradiction\]\#\# Premise: He turned and smiled at Vrenna\.\#\# Hypothesis: He smiled at Vrenna who was walking slowly behind him with her mother\.\#\# Response: \[neutral\]\#\# Premise: The famous tenements \(or lands\) began to be built\.\#\# Hypothesis: The land remained deserted\.\#\# Response: \[contradiction\]Table 14:4\-shot prompts for evaluating LLMs with SNLI\-neg and MNLI\-neg datasets\. The exemplars are chosen from the training split\.
### C\.3Experimental Details for NevIR
As NevIR does not provide any training data, followingWelleret al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib26)\), we perform fully\-supervised fine\-tuning with LLMs on STS\-BCeret al\.\([2017](https://arxiv.org/html/2604.19921#bib.bib39)\)using a learning rate of 2e\-5 and a batch size of 32 for 4 epochs\.
Table[15](https://arxiv.org/html/2604.19921#A3.T15)reports the 4\-shot prompts for evaluating with NevIR\.
You are a helpful assistant\. You are given a query and two documents\. Your task is to choose the document that has the answer for the query\.Output \[Doc1\] if the first document has the answer for the query, or \[Doc2\] if the second document has the answer\.\#\# Query: Which mayor did more vetoing than anticipated?\#\# Doc1: In his first year as mayor, Medill received very little legislative resistance from the Chicago City Council\. While he vetoed what was an unprecedented eleven City Council ordinances that year, most narrowly were involved with specific financial practices considered wasteful and none of the vetoes were overridden\. He used his new powers to appoint the members of the newly constituted Chicago Board of Education and the commissioners of its constituted public library\. His appointments were approved unanimously by the City Council\.\#\# Doc2: In his first year as mayor, Medill received very little legislative resistance from the Chicago City Council\. While some expected an unprecedented number of vetoes, in actuality he only vetoed eleven City Council ordinances that year, and most of those were narrowly involved with specific financial practices he considered wasteful and none of the vetoes were overridden\. He used his new powers to appoint the members of the newly constituted Chicago Board of Education and the commissioners of its constituted public library\. His appointments were approved unanimously by the City Council\.\#\# Response: \[Doc1\]\#\# Query: Which mayor did less vetoing than anticipated?\#\# Doc1: In his first year as mayor, Medill received very little legislative resistance from the Chicago City Council\. While he vetoed what was an unprecedented eleven City Council ordinances that year, most narrowly were involved with specific financial practices considered wasteful and none of the vetoes were overridden\. He used his new powers to appoint the members of the newly constituted Chicago Board of Education and the commissioners of its constituted public library\. His appointments were approved unanimously by the City Council\.\#\# Doc2: In his first year as mayor, Medill received very little legislative resistance from the Chicago City Council\. While some expected an unprecedented number of vetoes, in actuality he only vetoed eleven City Council ordinances that year, and most of those were narrowly involved with specific financial practices he considered wasteful and none of the vetoes were overridden\. He used his new powers to appoint the members of the newly constituted Chicago Board of Education and the commissioners of its constituted public library\. His appointments were approved unanimously by the City Council\.\#\# Response: \[Doc2\]\#\# Query: Which Swiss cantons do not have official churches?\#\# Doc1: Switzerland has no official state religion, though most of the cantons \(except Geneva and Neuchâtel\) recognise official churches, which are either the Roman Catholic Church or the Swiss Reformed Church\. These churches, and in some cantons also the Old Catholic Church and Jewish congregations, are financed by official taxation of adherents\.\#\# Doc2: Switzerland has no official state religion, though most of the cantons \(except Neuchâtel\) recognise official churches, which are either the Roman Catholic Church or the Swiss Reformed Church\. These churches, and in some cantons also the Old Catholic Church and Jewish congregations, are financed by official taxation of adherents\.\#\# Response: \[Doc1\]\#\# Query: Which Swiss canton does not have official churches?\#\# Doc1: Switzerland has no official state religion, though most of the cantons \(except Geneva and Neuchâtel\) recognise official churches, which are either the Roman Catholic Church or the Swiss Reformed Church\. These churches, and in some cantons also the Old Catholic Church and Jewish congregations, are financed by official taxation of adherents\.\#\# Doc2: Switzerland has no official state religion, though most of the cantons \(except Neuchâtel\) recognise official churches, which are either the Roman Catholic Church or the Swiss Reformed Church\. These churches, and in some cantons also the Old Catholic Church and Jewish congregations, are financed by official taxation of adherents\.\#\# Response: \[Doc2\]Table 15:4\-shot prompts to evaluate LLMs with NevIR\.
## Appendix DAdditional Results on Downstream Tasks
### D\.1Complete Results for Downstream Tasks
Table[3](https://arxiv.org/html/2604.19921#S4.T3)and Table[4](https://arxiv.org/html/2604.19921#S5.T4)report only the best results among three training corpora configurations using either¬\\negAtomic,¬\\negAnion, or both\. Table[16](https://arxiv.org/html/2604.19921#A4.T16)and Table[17](https://arxiv.org/html/2604.19921#A4.T17)provide complete results for all three configurations\.
Group Consistency\# Params\.Accuracy\(Δ\\Delta%\)AllPar\.Sco\.Aff\.Fully supervisedUnifiedQA\-v2\-largeRavichanderet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib17)\)770M66\.730\.264\.043\.746\.5RoBERTa\-large355M64\.929\.661\.941\.445\.8Further pre\-trained withOriginal commonsense triples fromAtomic\+Anion67\.0\(\+3\.2\)32\.965\.646\.348\.1Negated commonsense triples from¬\\negAtomic67\.7\(\+4\.3\)32\.866\.146\.348\.6¬\\negAnion67\.9\(\+4\.6\)∗33\.366\.045\.850\.1¬\\negAtomic\+¬\\negAnion68\.5\(\+5\.5\)∗34\.366\.447\.650\.1In\-context learningZero\-shotOpenAI o1—65\.324\.967\.443\.838\.6Few\-shotInstructGPT \+ COTRavichanderet al\.\([2022](https://arxiv.org/html/2604.19921#bib.bib17)\)—66\.327\.364\.245\.144\.9GPT\-4o—72\.934\.478\.752\.647\.9Llama 3\.1 70B70B77\.543\.383\.761\.854\.9Llama 3\.1 8B8B68\.731\.569\.048\.144\.4Further pre\-trained withOriginal commonsense triples fromAtomic\+Anion67\.6\(\-1\.6\)27\.569\.544\.341\.4Negated commonsense triples from¬\\negAtomic70\.6\(\+2\.8\)32\.871\.848\.645\.0¬\\negAnion71\.3\(\+3\.8\)∗33\.472\.549\.645\.7¬\\negAtomic\+¬\\negAnion71\.5\(\+4\.1\)∗33\.572\.948\.746\.4Qwen2 7B7B65\.124\.965\.939\.739\.2Further pre\-trained withOriginal commonsense triples fromAtomic\+Anion64\.1\(\-1\.5\)22\.865\.238\.437\.7Negated commonsense triples from¬\\negAtomic69\.7\(\+7\.1\)∗32\.771\.246\.345\.5¬\\negAnion69\.2\(\+6\.2\)∗29\.968\.845\.444\.1¬\\negAtomic\+¬\\negAnion66\.6\(\+2\.3\)28\.768\.843\.740\.8Table 16:Complete results evaluating CondaQA with two settings: fully supervised \(top\) and in\-context learning \(bottom\)\. We experiment with three training configurations on our corpora\. The best result for each model is in bold\. Delta \(Δ\\Delta\) indicates the percent change in accuracy compared to the off\-the\-shelf model\. An asterisk∗indicates a statistically significant improvement \(McNemar’s testMcNemar \([1947](https://arxiv.org/html/2604.19921#bib.bib35)\),p<0\.05p<0\.05\) over both the off\-the\-shelf model and the model trained with existing corpora\.\# Params\.NLI with NegationNevIRRTE\-NegSNLI\-NegMNLI\-NegFully supervisedBERTNOTHosseiniet al\.\([2021](https://arxiv.org/html/2604.19921#bib.bib27)\)110M74\.546\.060\.9—RoBERTa\-large\-NSPRezaei and Blanco \([2025](https://arxiv.org/html/2604.19921#bib.bib28)\)355M87\.256\.569\.9—stsb\-roberta\-largeWelleret al\.\([2024](https://arxiv.org/html/2604.19921#bib.bib26)\)355M———24\.9MonoT5 3BNogueiraet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib24)\)3B———50\.6RoBERTa\-large355M84\.756\.069\.924\.5Further pre\-trained withOriginal commonsense triples fromAtomic\+Anion86\.256\.569\.429\.1Negated commonsense triples from¬\\negAtomic81\.556\.670\.929\.8¬\\negAnion85\.257\.169\.330\.5¬\\negAtomic\+¬\\negAnion88\.1∗58\.369\.734\.3∗Zero\-shotOpenAI o1—87\.575\.974\.659\.7Few\-shotGPT\-4o—86\.974\.875\.061\.7Llama 3\.1 70B70B78\.969\.165\.958\.8Llama 3\.1 8B8B60\.054\.447\.030\.6Further pre\-trained withOriginal commonsense triples fromAtomic\+Anion65\.357\.551\.137\.9Negated commonsense triples from¬\\negAtomic81\.2∗68\.5∗63\.8∗38\.1¬\\negAnion81\.3∗68\.2∗63\.9∗42\.2∗¬\\negAtomic\+¬\\negAnion81\.3∗67\.9∗63\.7∗38\.2Qwen2 7B7B71\.760\.759\.529\.8Further pre\-trained withOriginal commonsense triples fromAtomic\+Anion78\.366\.563\.033\.8Negated commonsense triples from¬\\negAtomic82\.3∗72\.4∗67\.5∗39\.8∗¬\\negAnion78\.966\.162\.536\.3¬\\negAtomic\+¬\\negAnion82\.1∗72\.2∗67\.5∗38\.5∗Table 17:Complete results evaluating three NLI benchmarks with negation cues \(accuracy\) and NevIR benchmark \(pairwise accuracy\) in two settings: \(1\) fully supervised \(top\) and \(2\) in\-context learning \(bottom\)\. We experiment with three training configurations on our corpora\. The best result for each model is in bold\. An asterisk∗indicates a statistically significant improvement over both the off\-the\-shelf model and the model trained with existing corpora\.
### D\.2Evaluation on CommonsenseQA
We evaluate off\-the\-shelf and pre\-trained models on CommonsenseQATalmoret al\.\([2019](https://arxiv.org/html/2604.19921#bib.bib18)\), a standard commonsense reasoning benchmark without negation, to verify that pre\-training on our negated corpora does not degrade general commonsense reasoning ability\. Table[18](https://arxiv.org/html/2604.19921#A4.T18)reports the results\. Both models maintain comparable performance after pre\-training, with Llama 3\.1 8B slightly improving \(71\.8→\\rightarrow72\.7\) and Qwen2 7B showing a minor decrease \(80\.7→\\rightarrow79\.5\)\. These results demonstrate that pre\-training on negated corpora does not lead to catastrophic forgetting of general commonsense knowledge\.
CommonsenseQALlama 3\.1 8B71\.8Pre\-trained with¬\\negAtomic\+¬\\negAnion72\.7Qwen2 7B80\.7Pre\-trained with¬\\negAtomic\+¬\\negAnion79\.5Table 18:Results on CommonsenseQA \(accuracy\), a non\-negated commonsense reasoning benchmark\. Pre\-training on our negated corpora does not degrade performance\.
### D\.3Ablation on Randomly Labelled Data
Although our LLM judge achieves relatively high precision on validating commonsense triples with negation, the resulting data are still noisy\. To quantify the impact of data quality, we train models on a noisier dataset—with randomly assigned labels—where the validation accuracy forValidandInvalidtriples is approximately 0\.50\.
Table[19](https://arxiv.org/html/2604.19921#A4.T19)reports the results\. Training with randomly labelled data substantially degrades performance compared to LLM\-validated data across all tasks\. Most notably, NevIR drops well below the off\-the\-shelf baseline for both models \(18\.9 vs\. 30\.6 for Llama 3\.1 8B; 12\.0 vs\. 29\.8 for Qwen2 7B\), indicating that noisy labels can be actively harmful for complex negation reasoning\. For NLI tasks, randomly labelled data still yields some improvement over the baseline \(e\.g\., RTE: 70\.8 vs\. 60\.0 for Llama 3\.1 8B\), suggesting that the model learns partial negation patterns from the triple structure itself\. However, the gains are far smaller than with validated data \(e\.g\., RTE: 70\.8 vs\. 81\.3\)\. These results validate the importance of our LLM judge: even imperfect labels substantially outperform random ones\.
CondaQANLI with NegationNevIR\(Accuracy\)RTE\-NegSNLI\-NegMNLI\-NegLlama 3\.1 8B68\.760\.054\.447\.030\.6Pre\-trained withRandomly labelled¬\\negAtomic\+¬\\negAnion67\.170\.857\.550\.918\.9¬\\negAtomic\+¬\\negAnion71\.581\.367\.963\.738\.2Qwen2 7B65\.171\.760\.759\.529\.8Pre\-trained withRandomly labelled¬\\negAtomic\+¬\\negAnion65\.276\.162\.061\.912\.0¬\\negAtomic\+¬\\negAnion66\.682\.172\.267\.538\.5Table 19:Ablation on randomly labelled training data\. We train Llama 3\.1 8B and Qwen2 7B with randomly labelled¬\\negAtomic\+¬\\negAnion\. While the randomly labelled dataset demonstrates marginal benefits on CondaQA and NLI tasks, training with the validated dataset significantly outperforms\. In addition, randomly labelled data is detrimental to NevIR performance\.
## Appendix EError Analysis
We conduct a detailed error analysis comparing the off\-the\-shelf Llama 3\.1 8B with the model pre\-trained on¬\\negAtomic\+¬\\negAnionacross all downstream tasks\. Our analysis categorizes the types of negation reasoning errors that are corrected by training with our augmented corpora\.
### E\.1Error Types in NLI with Negation
We examine all 940 examples across RTE\-Neg, SNLI\-Neg, and MNLI\-Neg where the pre\-trained model answers correctly but the off\-the\-shelf model does not\. Following the negation taxonomy ofHossainet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib25)\), we categorize these improvements by negation type in Table[20](https://arxiv.org/html/2604.19921#A5.T20)\.
Error TypeCountPercentageVerbal negation in premise only32334\.4%Verbal negation in hypothesis only22023\.4%Negation interaction \(both P & H\)38841\.3%Affixal negation20\.2%Other70\.7%
Table 20:Distribution of error types corrected by pre\-training with our commonsense corpora, across 940 improved NLI examples \(Llama 3\.1 8B\)\. Categories follow the negation taxonomy ofHossainet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib25)\)\.#### Verbal Negation in Premise Only \(34\.4%\)
In these cases, the premise contains an explicit verbal negation cue \(e\.g\.,not,never,no\) but the hypothesis does not\. The off\-the\-shelf model often treats the negated premise as if the negation were absent\. For example, given the premise“The prosecutor did not tell the court that the incident had caused ‘distress’ to one of the children\.”and the hypothesis“The prosecutor told the court that ‘distress’ in one of the children is associated with the incident\.”, the off\-the\-shelf model predictsentailment, ignoring“did not”in the premise\. The pre\-trained model correctly predictsnot\_entailment\.
#### Verbal Negation in Hypothesis Only \(23\.4%\)
These examples contain verbal negation in the hypothesis but not in the premise\. The off\-the\-shelf model struggles to determine whether a non\-negated premise supports or contradicts a negated hypothesis\.
#### Negation Interaction \(41\.3%\)
The largest category involves examples where both the premise and hypothesis contain negation, requiring the model to reason about the interaction between multiple negation cues\. For instance, given“This growing number of titles does not leave publishing houses with less time and attention to edit and market books\.”and“Publishing houses cannot give less attention to editing books\.”, the off\-the\-shelf model predictsentailment\. However, the correct label isneutral, as the negated premise no longer supports the negated hypothesis\. The pre\-trained model correctly identifies this relationship\.
#### Affixal Negation and Other \(0\.9%\)
A small fraction of improvements involve affixal negation \(e\.g\.,un\-,dis\-\) or other negation types\. This low proportion reflects the dominance of verbal negation cues in NLI benchmarksHossainet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib25)\)\.
### E\.2Analysis by Negation Cue Type in CondaQA
CondaQA annotates each example with a negation cue word\. Following the negation taxonomy ofHossainet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib25)\), we categorize these cues into five linguistic types and report accuracy for the off\-the\-shelf and pre\-trained Llama 3\.1 8B in Table[21](https://arxiv.org/html/2604.19921#A5.T21)\.
CategoryExamplesCountOff\-the\-shelfPre\-trainedVerbalnot, never, no, none1,90267\.674\.1Affixalun\-, in\-, dis\-, \-less3,55362\.168\.0Implicitlack, without, prevent1,50267\.969\.7Diminisherrarely, barely, few25966\.068\.3Other2479\.275\.0Table 21:CondaQA accuracy by negation cue category \(Llama 3\.1 8B\)\. Categories follow the negation taxonomy ofHossainet al\.\([2020](https://arxiv.org/html/2604.19921#bib.bib25)\):Verbal: explicit negation words;Affixal: morphological negation \(prefixes/suffixes\);Implicit: words conveying negation without explicit markers;Diminisher: words that reduce the degree of an assertion\.Pre\-training with our corpora yields the largest improvement for verbal negation cues \(\+6\.5 points\), which aligns with our training data where negation is introduced by addingnotto events\. Affixal negation also shows substantial gains \(\+5\.9 points\), indicating effective transfer from explicit to morphological negation\. Diminisher cues \(e\.g\.,rarely,barely,few\) improve by \+2\.3 points, suggesting that learning explicit negation patterns helps models better handle attenuated assertions\. Implicit negation cues show a modest improvement \(\+1\.8 points\), indicating that implicit negation such aslackorpreventpartially benefits from explicit negation training but may require additional training signals\. The Other category \(n=24\) is too small for reliable estimates\.
### E\.3Case Studies
We present representative case studies illustrating specific reasoning patterns improved by our approach\.
#### Case 1: Negation Scope in NLI
> Premise:“A barefoot young girl in a pink gown isnotasleep on a hard wood floor cuddling her baby doll\.” Hypothesis:“A girl is playing with her doll outside\.” Gold:neutral Off\-the\-shelf:contradiction Pre\-trained:neutral
The off\-the\-shelf model interprets“not asleep”as contradicting“playing outside”, failing to recognize that negatingasleepdoes not specify what the girl is doing or where she is\. The pre\-trained model correctly identifies that the negated premise leaves the hypothesis unresolved\.
#### Case 2: Negation and Entailment Direction
> Premise:“Organic fertilizers like vermi compost arenotused for increasing the quality, fertility and mineral content of the soil\.” Hypothesis:“Organic fertilizers are used as soil enhancers\.” Gold:not\_entailment Off\-the\-shelf:entailment Pre\-trained:not\_entailment
The off\-the\-shelf model associatesorganic fertilizerswithsoil enhancersbased on world knowledge, completely ignoring the negation in the premise\. Pre\-training on commonsense triples with negation teaches the model that a negated*if*event does not entail the original*then*event\.
#### Case 3: Affixal Negation in CondaQA
> Cue:unmyelinated Question:“Does the wording of the passage suggest that axons have myelin sheaths while neurons typically do not?” Gold:No Off\-the\-shelf:Yes Pre\-trained:No
The prefixun\-inunmyelinatedreverses the meaning, and the pre\-trained model correctly identifies that the passage does not support the hypothesis\.
#### Case 4: Negation in Information Retrieval \(NevIR\)
> Q1:“Whose ship is attacked byunfamiliarenemies?” Q2:“Whose ship is attacked byfamiliarenemies?” Expected:Q1→\\rightarrowDoc1, Q2→\\rightarrowDoc2
NevIR requires the model to distinguish between a query and its negated counterpart\. The model must identify which document contains information matching the specific polarity of each query—a task that directly requires understanding whether an event is negated\.Similar Articles
Disparities In Negation Understanding Across Languages In Vision-Language Models
MIT researchers release the first multilingual negation benchmark covering seven languages and show VLMs like CLIP struggle with non-Latin scripts, while MultiCLIP and SpaceVLM offer uneven improvements across languages.
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
Proposes Slipform, a training framework that uses lexical concreteness to select harder negatives and a margin-based Cement loss, boosting compositional reasoning in vision-language models.
HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities
The paper introduces Hard Negative Captions (HNC), a dataset and method for training vision-language models to achieve fine-grained comprehension by addressing weak associations in web-collected image-text pairs.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
This paper introduces ResRL, a method to boost LLM reasoning by decoupling semantic distributions between positive and negative responses through negative sample projection. It aims to maintain generation diversity while improving performance on various benchmarks.
Whose Facts Win? LLM Source Preferences under Knowledge Conflicts
This paper investigates how LLMs handle knowledge conflicts in retrieval-augmented generation by studying their preferences for different information sources. The authors find that LLMs prefer institutionally-corroborated sources but these preferences can be reversed by repetition, proposing a method to reduce repetition bias while maintaining consistent source preferences.