MeasHalu: Mitigation of Scientific Measurement Hallucinations for Large Language Models with Enhanced Reasoning
Summary
MeasHalu is a novel framework for mitigating scientific measurement hallucinations in LLMs through a two-stage reasoning-aware fine-tuning strategy and progressive reward curriculum. It introduces a fine-grained taxonomy of measurement-specific hallucinations and demonstrates improved accuracy on the MeasEval benchmark.
View Cached Full Text
Cached at: 04/21/26, 07:05 AM
# MeasHalu: Mitigation of Scientific Measurement Hallucinations for Large Language Models with Enhanced Reasoning Source: [https://arxiv.org/html/2604.16929](https://arxiv.org/html/2604.16929) Ruijun Huang1,Zhiqiao Kang1,Yuxuan Zhu1,Junxiong Li1,Jiahao Zhao1, Minghuan Tan1,Feng Jiang211footnotemark:1,Min Yang1 1Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, 2Artificial Intelligence Research Institute, Shenzhen University of Advanced Technology Correspondence:[mh\.tan@siat\.ac\.cn, jiangfeng@suat\-sz\.edu\.cn](https://arxiv.org/html/2604.16929v1/mailto:[email protected],[email protected]) ###### Abstract The accurate extraction of scientific measurements from literature is a critical yet challenging task in AI4Science, enabling large\-scale analysis and integration of quantitative research findings\. However, Large Language Models \(LLMs\) frequently exhibit severe hallucinations, which significantly undermine the reliability of automated scientific document understanding systems\. To address this problem, we proposeMeasHalu, a novel framework for mitigating scientific measurement hallucinations through enhanced reasoning and targeted optimization\. We first present a fine\-grained taxonomy of measurement\-specific hallucinations, categorizing errors across quantities, units, modifiers, and relations\. Our approach incorporates a two\-stage reasoning\-aware fine\-tuning strategy using augmented scientific data and process\-based supervision\. Furthermore, we introduce a progressive reward curriculum designed to penalize specific hallucination types, significantly improving extraction faithfulness\. Experimental results demonstrate thatMeasHalusubstantially reduces hallucination rates and improves overall accuracy on the MeasEval benchmark\. This work provides a targeted solution to a key bottleneck in automated scientific knowledge extraction, facilitating more trustworthy and scalable machine\-assisted scientific literature analysis\. Our codes and data are publicly available on[https://github\.com/CAS\-SIAT\-XinHai/MeasHalu](https://github.com/CAS-SIAT-XinHai/MeasHalu)\. MeasHalu: Mitigation of Scientific Measurement Hallucinations for Large Language Models with Enhanced Reasoning Ruijun Huang1, Zhiqiao Kang1, Yuxuan Zhu1, Junxiong Li1, Jiahao Zhao1,Minghuan Tan1††thanks:Corresponding author,Feng Jiang211footnotemark:1,Min Yang11Shenzhen Key Laboratory for High Performance Data Mining,Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences,2Artificial Intelligence Research Institute, Shenzhen University of Advanced TechnologyCorrespondence:[mh\.tan@siat\.ac\.cn, jiangfeng@suat\-sz\.edu\.cn](https://arxiv.org/html/2604.16929v1/mailto:[email protected],[email protected]) Figure 1:Motivation of MeasHalu\.To rectify parsing failures, we propose a taxonomy\-based approach to mitigate quantity and relation hallucinations\.## 1Introduction The rapid expansion of scientific literature has created an unprecedented demand for reliable automatic extraction of quantitative knowledge, which lies at the core of modern AI4Science applications such as large\-scale meta\-analysis, knowledge base construction, and autonomous scientific discovery\(Hansonet al\.,[2024](https://arxiv.org/html/2604.16929#bib.bib108); Chenet al\.,[2025](https://arxiv.org/html/2604.16929#bib.bib109)\)\. Central to this process is*scientific measurement extraction*—the task of identifying numerical quantities, their units, modifiers, and their relationships to measured entities and properties\. These quantitative statements form the evidential backbone of experimental sciences across disciplines ranging from materials science to biomedical research\(Berrahouet al\.,[2013](https://arxiv.org/html/2604.16929#bib.bib87); Kononovaet al\.,[2021](https://arxiv.org/html/2604.16929#bib.bib88)\)\. Although recent Large Language Models \(LLMs\) have demonstrated remarkable generalization abilities, they continue to perform unreliably on this task\(Foppianoet al\.,[2024](https://arxiv.org/html/2604.16929#bib.bib86)\): even minor hallucinations in quantities or relations can invalidate entire experimental conclusions, severely limiting the trustworthiness of LLM\-driven scientific understanding systems\. A key challenge underlying this failure is that*measurement hallucinations differ fundamentally from general textual hallucinations*\. Unlike open\-domain factual errors, measurement hallucinations exhibit fine\-grained structural failures: models fabricate nonexistent values, misassociate quantities with wrong entities, overlook crucial qualifiers, or distort relations between scientific variables\(Saieret al\.,[2024](https://arxiv.org/html/2604.16929#bib.bib83)\)\. Existing hallucination mitigation techniques, such as retrieval augmentation\(Lewiset al\.,[2020](https://arxiv.org/html/2604.16929#bib.bib111)\), generic instruction tuning, or conversational verification\(Polak and Morgan,[2024](https://arxiv.org/html/2604.16929#bib.bib112)\), remain insufficient, as they are not designed to enforce the strict grounding and structural consistency required by scientific measurements\. Yet, despite the importance of this problem, current research lacks both a systematic analysis of measurement\-specific hallucination phenomena and targeted learning mechanisms for their mitigation\. For instance, even state\-of\-the\-art LLM\-based extraction systems often compromise faithfulness by generating implicit information that is absent from the original text, such as inferring chemical formulas\(Dagdelenet al\.,[2024](https://arxiv.org/html/2604.16929#bib.bib113)\)\. In this work, we presentMeasHalu, a reasoning\-enhanced framework that explicitly models and suppresses scientific measurement hallucinations in LLMs\. Our central insight is that hallucinations in this domain arise from two intertwined sources: \(1\) unreliable quantitative reasoning that corrupts individual quantities and units, and \(2\) fragile long\-range relational reasoning that breaks the alignment between quantities, entities, and scientific properties\.MeasHaluaddresses these failure modes through a unified learning pipeline that combines reasoning\-aware supervised fine\-tuning with targeted reinforcement learning via structured reward shaping, thereby internalizing scientific grounding constraints directly into model parameters\. Concretely,MeasHaluintroduces a fine\-grained taxonomy of measurement hallucinations, and leverages this analysis to design a progressive optimization strategy: an initial supervised stage that standardizes quantitative reasoning and extraction structure, followed by Group Relative Policy Optimization \(GRPO\) with carefully constructed rewards that penalize fabrication, out\-of\-scope predictions, misclassification, and relational incompleteness\. Our framework is developed on top of the MeasEval annotation schema\(Harperet al\.,[2021b](https://arxiv.org/html/2604.16929#bib.bib130)\)and integrates external quantity validators, including CQE\(Almasianet al\.,[2023b](https://arxiv.org/html/2604.16929#bib.bib114)\)and Quantulum111[https://github\.com/nielstron/quantulum3](https://github.com/nielstron/quantulum3), during training\. Extensive experiments on the MeasEval benchmark and our newly constructed MeasEval\-Ext dataset demonstrate thatMeasHalusubstantially reduces hallucination rates and consistently outperforms strong supervised baselines and proprietary LLMs\. Furthermore, we show thatMeasHalufunctions as a reliable external measurement extraction tool that significantly improves performance on downstream embodied scientific tasks, validating its practical utility for trustworthy AI4Science systems\. Our contributions are summarized as follows: We provide the first fine\-grained analysis of scientific*measurement hallucinations*in large language models, revealing their structural nature and identifying two fundamental sources of failure: unreliable quantitative reasoning and fragile relational grounding\. We proposeMeasHalu, a unified reasoning\-enhanced learning framework that systematically suppresses measurement hallucinations by integrating reasoning\-aware supervised fine\-tuning with targeted reinforcement learning via structured reward shaping\. We construct a new out\-of\-distribution evaluation benchmark,MeasEval\-Ext, and demonstrate through extensive experiments thatMeasHalusubstantially reduces hallucination rates and consistently outperforms strong supervised baselines and proprietary LLMs on scientific measurement extraction\. We further show thatMeasHaluserves as a reliable external measurement extraction tool that significantly improves performance on downstream embodied scientific tasks, validating its practical utility for trustworthy AI4Science systems\. ## 2Related Work ### 2\.1Hallucinations in Large Language Models Hallucination, where language models generate ungrounded or factually incorrect content, has been extensively studied in general\-purpose LLMsHuanget al\.\([2025](https://arxiv.org/html/2604.16929#bib.bib118)\)\. Most prior work focuses on semantic and factual hallucinations in open\-ended generationJiet al\.\([2023](https://arxiv.org/html/2604.16929#bib.bib116)\), with typical taxonomies including fabrication, inconsistency, and logical errorsLiet al\.\([2025](https://arxiv.org/html/2604.16929#bib.bib117)\)\. However, these taxonomies are largely developed for free\-form text generation and do not capture the structural requirements of measurement extraction, where numerical faithfulness, unit consistency, and entity\-quantity relational grounding are essential\. We address this gap by proposing a fine\-grained taxonomy of*measurement\-specific hallucinations*and designing mitigation mechanisms tailored to these failure modes\. ### 2\.2General Information Extraction vs\. Measurement Extraction Information extraction \(IE\) and named entity recognition \(NER\) are foundational NLP tasksNadeau and Sekine \([2007](https://arxiv.org/html/2604.16929#bib.bib115)\)\. While early systems relied on rule\-based and feature\-engineered pipelines, modern approaches increasingly leverage neural architectures and pre\-trained language models\. Nevertheless,*scientific measurement extraction*poses additional constraints beyond conventional IE: models must accurately capture numerical values, units, and modifiers, and preserve their structured relations to measured entities and properties under strict grounding\. These constraints make the task particularly sensitive to hallucinations and motivate learning objectives that explicitly penalize fabrication, mis\-scoping, and relational incompleteness\. ### 2\.3Scientific Measurement Extraction and Benchmarks Scientific information extraction has been advanced by datasets such asSciERCLuanet al\.\([2018](https://arxiv.org/html/2604.16929#bib.bib120)\)andMeasEvalHarperet al\.\([2021a](https://arxiv.org/html/2604.16929#bib.bib119)\)\. Among them,MeasEvalprovides the most fine\-grained annotation schema for scientific measurements, including quantities, units, modifiers, and their relations, and has become a key benchmark for evaluating measurement extraction systems\. Despite progress, numerically grounded and relation\-consistent extraction remains challenging, especially for complex sentences containing multiple measurements and implicit constraintsXuet al\.\([2024](https://arxiv.org/html/2604.16929#bib.bib126)\)\. Our work builds on theMeasEvalschema and targets these persistent failure modes with a hallucination\-aware optimization framework\. ### 2\.4Mitigation Strategies for LLM Hallucinations A wide range of techniques have been proposed to reduce hallucinations in LLMs, including retrieval\-augmented generation \(RAG\)Lewiset al\.\([2020](https://arxiv.org/html/2604.16929#bib.bib111)\), supervised fine\-tuning \(SFT\)Zhouet al\.\([2023](https://arxiv.org/html/2604.16929#bib.bib122)\), chain\-of\-thought promptingWeiet al\.\([2022](https://arxiv.org/html/2604.16929#bib.bib121)\), process\-based supervisionLightmanet al\.\([2023](https://arxiv.org/html/2604.16929#bib.bib125)\), reinforcement learning from human feedback \(RLHF\)Ouyanget al\.\([2022](https://arxiv.org/html/2604.16929#bib.bib123)\), and direct preference optimization \(DPO\)Rafailovet al\.\([2023](https://arxiv.org/html/2604.16929#bib.bib124)\)\. While effective for open\-ended generation, these methods are not explicitly designed to enforce the strict grounding and structural consistency required by scientific measurement extraction\. In contrast, our approach integrates reasoning\-aware SFT with targeted reinforcement learning and structured reward shaping, explicitly encoding measurement\-specific constraints to suppress hallucinations at their structural root\. Despite significant progress in hallucination mitigation, prior work has neither systematically characterized hallucinations in scientific measurement extraction nor introduced specialized reward objectives tailored to its error patterns\. We bridge this gap by unifying a fine\-grained hallucination taxonomy with a progressive optimization framework designed specifically for measurement\-specific error suppression\. ## 3Methodology Informed by our analysis in Section[2](https://arxiv.org/html/2604.16929#S2), we designMeasHaluaround a central hypothesis:*scientific measurement hallucinations arise from two fundamentally different failure modes—unreliable quantitative reasoning and fragile relational grounding*\. Accordingly, our framework adopts a two\-branch mitigation strategy, targetingQuantity HallucinationsandRelation\-based Hallucinationsrespectively\. As illustrated in Figure[2](https://arxiv.org/html/2604.16929#S3.F2),MeasHaluintegrates progressive supervised fine\-tuning with hallucination\-aware reinforcement learning, enabling the model to internalize strict scientific grounding constraints directly into its reasoning process\. These hallucinations \(see Table[10](https://arxiv.org/html/2604.16929#A5.T10)\) significantly undermine the reliability of LLMs for this critical task\. Figure 2:Overview of our method consisting of two stages, Supervised Fine\-Tuning & GRPO based Reinforcement Learning\.### 3\.1Quantity Hallucination Mitigation Unlike prior approaches that employ end\-to\-end joint training for quantities and relations, our method first trains quantity extraction independently\. Furthermore, following the SFT stage, we incorporate a GRPO phase specifically driven by hallucination\-targeted rewards to further mitigate hallucinations\. #### 3\.1\.1Progressive Supervised Fine\-tuning To endow the LLM with structured quantity reasoning capabilities, we adopt a progressive SFT strategy\. Specifically, we first utilize𝒟aug\\mathcal\{D\}\_\{\\text\{aug\}\}to establish foundational quantity reasoning skills, followed by fine\-tuning on𝒟trace\\mathcal\{D\}\_\{\\text\{trace\}\}to ensure rigorous alignment with MeasEval standards\. The construction details of these two datasets are elaborated below\. ##### 𝒟aug\\mathcal\{D\}\_\{\\text\{aug\}\} We curate an unlabeled corpus𝒳un\\mathcal\{X\}\_\{\\text\{un\}\}from arXiv paper abstracts\(Cohanet al\.,[2018](https://arxiv.org/html/2604.16929#bib.bib103)\)\. Lacking gold quantity annotations, we use Quantulum3 \(fqtmf\_\{\\text\{qtm\}\},222https://github\.com/nielstron/quantulum3\) to extract quantity candidates, then leverage an augmentation template𝒫aug\\mathcal\{P\}\_\{\\text\{aug\}\}to promptℳ\\mathcal\{M\}to verify these anchors and generate a reasoning trajectoryhaugh\_\{\\text\{aug\}\}\. Formally, forx∈𝒳unx\\in\\mathcal\{X\}\_\{\\text\{un\}\}: y~\\displaystyle\\tilde\{y\}←fqtm\(x\),haug←ℳ\(x,y~;𝒫aug\)\\displaystyle\\leftarrow f\_\{\\text\{qtm\}\}\(x\),\\quad h\_\{\\text\{aug\}\}\\leftarrow\\mathcal\{M\}\(x,\\tilde\{y\};\\mathcal\{P\}\_\{\\text\{aug\}\}\)\(1\)wherey~\\tilde\{y\}is noisy pseudo\-labels fromfqtmf\_\{\\text\{qtm\}\}, and𝒫aug\\mathcal\{P\}\_\{\\text\{aug\}\}guidesℳ\\mathcal\{M\}to filter false positives via semantics\. Valid trajectories form𝒟aug=\{\(x,haug\)\}i=120K\\mathcal\{D\}\_\{\\text\{aug\}\}=\\\{\(x,h\_\{\\text\{aug\}\}\)\\\}\_\{i=1\}^\{20K\}\. ##### 𝒟trace\\mathcal\{D\}\_\{\\text\{trace\}\} We leverage the training split of the MeasEval dataset, which contains human\-annotated gold quantity labelsygty\_\{\\text\{gt\}\}, and adopt atraceback template𝒫trace\\mathcal\{P\}\_\{\\text\{trace\}\}to guide reasoning reconstruction: givenygty\_\{\\text\{gt\}\}, the model generates a stepwise reasoning trajectoryhtraceh\_\{\\text\{trace\}\}leading to the gold conclusion, formulated as: htrace←ℳ\(x,ygt;𝒫trace\)h\_\{\\text\{trace\}\}\\leftarrow\\mathcal\{M\}\(x,y\_\{\\text\{gt\}\};\\mathcal\{P\}\_\{\\text\{trace\}\}\)\(2\) To ensure correctness of the reasoning trajectory, we enforce strict consistency validation via𝕀\(⋅\)\\mathbbm\{I\}\(\\cdot\), whereconcl\(⋅\)\\text\{concl\}\(\\cdot\)extracts the final quantity fromhtraceh\_\{\\text\{trace\}\}\. The filtered dataset is constructed as: 𝒟trace=\{\(x,htrace\)∣𝕀\(concl\(htrace\),ygt\)=1\}\\mathcal\{D\}\_\{\\text\{trace\}\}=\\\{\(x,h\_\{\\text\{trace\}\}\)\\mid\\mathbbm\{I\}\\big\(\\text\{concl\}\(h\_\{\\text\{trace\}\}\),y\_\{\\text\{gt\}\}\\big\)=1\\\}\(3\)The prompts𝒫trace\\mathcal\{P\}\_\{\\text\{trace\}\}and𝒫aug\\mathcal\{P\}\_\{\\text\{aug\}\}are provided in Appendix[A](https://arxiv.org/html/2604.16929#A1)\. #### 3\.1\.2Hallucination\-targeted Reward Function The total rewardR\(y\)R\(y\)is a weighted sum of four components targeting distinct quantity\-related hallucinations: R=w1rfmt\+w2rscope\+w3rfab\+w4rmisR=w\_\{1\}r\_\{\\text\{fmt\}\}\+w\_\{2\}r\_\{\\text\{scope\}\}\+w\_\{3\}r\_\{\\text\{fab\}\}\+w\_\{4\}r\_\{\\text\{mis\}\}\(4\) ##### Format compliance reward \(rfmtr\_\{\\text\{fmt\}\}\): A binary reward is assigned for strict adherence to the predefined structure<ARABIC\>, …,<CONCLUSION\>, enforcing schema compliance and parsability of generated reasoning chains\. ##### Out\-of\-scope hallucination penalty \(rscoper\_\{\\text\{scope\}\}\): A penalty is imposed when the model extracts out\-of\-scope entities—such as figure labels \(e\.g\., “Fig\. 1”\)—that do not constitute valid numerical data\. This mechanism utilizes pattern recognition to identify and penalize specific noisy strings, while simultaneously penalizing any generated answers that fail to match the ground truth, ensuring that the model avoids generating arbitrary numbers that deviate from the objective definitions\. ##### Fabrication hallucination penalty \(rfabr\_\{\\text\{fab\}\}\): This penalty targets invalid quantity fabrication by verifying each extracted entity against a hybrid physical parser𝒯parse\\mathcal\{T\}\_\{\\text\{parse\}\}\. A penalty is triggered if the extracted string fails to be parsed as a valid physical or numerical quantity, preventing the model from inventing nonsensical values\. ##### Misclassification hallucination reward \(rmisr\_\{\\text\{mis\}\}\): A reward is assigned based on token\-level precision to mitigate misclassification hallucinations\. This mechanism imposes a penalty if the model generates excessively long spans that erroneously incorporate surrounding components, such as the MeasuredEntity, into the quantity extraction\. Detailed mathematical derivations and implementation specifics for these reward components are provided in Appendix[D](https://arxiv.org/html/2604.16929#A4)\. ### 3\.2Relation\-based Hallucination Mitigation The extraction of relation\-based scientific measurements is particularly challenging due to long\-range contextual dependencies that frequently induce hallucinations\. Compared to traditional rule\-based approaches that generate answers after exhaustively processing complex constraints, our approach first pinpoints the quantity\-containing sentence, with subsequent reasoning anchored to this local context to extract the quantity and its relations—eliminating cross\-sentence hallucination triggers\. We implement this strategy via SFT for schema establishment and GRPO for hallucination\-targeted alignment\. Our sentence\-based strategy also provides efficiency benefits by reducing redundant global reasoning\. Detailed analysis is provided in Appendix[F](https://arxiv.org/html/2604.16929#A6)\. #### 3\.2\.1Quantity\-Guided Relation Extraction Given the document textxxand a list of candidate quantities𝒬in\\mathcal\{Q\}\{\\text\{in\}\}, the extraction pipelineAhaluA\{\\text\{halu\}\}follows a two\-stage chain\-of\-thought reasoning process\. First, the model identifies the evidence sentencesS=s1,…,snS=\{s\_\{1\},\\dots,s\_\{n\}\}that contain quantities in𝒬in\\mathcal\{Q\}\{\\text\{in\}\}: S←Ahalu\(x,𝒬in\)S\\leftarrow A\_\{\\text\{halu\}\}\(x,\\mathcal\{Q\}\_\{\\text\{in\}\}\)\(5\)whereSSdenotes the target sentences\. The model then performs fine\-grained reasoning overSSto resolve quantity attributes \(e\.g\., units and modifiers\) and associate them with their corresponding measured entities or properties, yielding the final structured relationsℛ\\mathcal\{R\}\. This two\-stage reasoning is learned via supervised fine\-tuning on rule\-derived traces from MeasEval annotations\. #### 3\.2\.2Hallucination\-targeted Reward Function While sentence\-based extraction excels at local entity identification, it often struggles to capture long\-range dependency chains \(e\.g\.,MeasuredEntity,Qualifier\), which frequently induces inference bias and leads to the under\-extraction of sparse components\. To mitigate these reasoning biases and suppress the resulting hallucinations, we design a composite reward functionRRoptimized via GRPO\. The reward function are formulated as: R=w1rfmt\+w2rcomp\+w3rmisR=w\_\{1\}r\_\{\\text\{fmt\}\}\+w\_\{2\}r\_\{\\text\{comp\}\}\+w\_\{3\}r\_\{\\text\{mis\}\}\(6\)The design rationale for each reward component is detailed below: ##### Format compliance reward \(rfmtr\_\{\\text\{fmt\}\}\) A composite reward is assigned to enforce strict adherence to the quantitative schema and ensure textual grounding\. It imposes two constraints: first, validating the structural segmentation of reasoning sections to prevent schema collapse; second, verifying that each extracted sentence can be mapped to a valid text span in the source document\. ##### Relational completeness reward \(rcompr\_\{\\text\{comp\}\}\) To mitigate inference\-induced and role\-definition hallucinations stemming from broken dependency links, this reward is designed to enforce the structural integrity of the reasoning chain\. The mechanism drives comprehensive exploration through a two\-tier incentive structure: \(1\) a stepwise reward for incremental component extraction and a weighted exploration term that prioritizes harder\-to\-predict components to drive model exploration \(2\) a completeness bonus awarded only upon full recovery of the gold\-standard relation group to enforce the structural integrity of the reasoning chain\. ##### Misclassification hallucination reward \(rmisr\_\{\\text\{mis\}\}\) A reward is assigned based on token\-level precision to mitigate misclassification hallucinations\. This mechanism imposes a penalty if the model generates excessively long spans that erroneously incorporate surrounding components\. The detailed mathematical formulations for these reward components are provided in Appendix[E](https://arxiv.org/html/2604.16929#A5)\. ## 4Experiments In this section, We evaluate our method on quantity extraction and relation identification using the MeasEval benchmark\. Additionally, we introduceMeasEval\-Ext, a specialized dataset annotated from recent literature to target novel units and complex expressions absent from the training distribution\. Further analyses are conducted including entropy dynamic analysis and its utility as a functional tool within downstream embodied AI tasks\. ### 4\.1Effectiveness of Quantity Hallucination Mitigation Strategies ##### Setup We utilize the Quantity subset of the MeasEval dataset as our primary evaluation benchmark\. To verify the scalability and robustness of our approach, we employ theQwen2\.5\-Instructseries across three different scales \(0\.5B, 3B, and 7B\) as the base models\. Table 1:Quantity extraction performance of 0\.5B, 3B and 7B models \(Mean±\\pmStd\)\. ##### Results Our full modelMeasHalu\-Quantachieves consistent performance advantages across all model scales in Table[1](https://arxiv.org/html/2604.16929#S4.T1)\. Compared to the baseline without GRPO,the integration of GRPO drives SOTA results of 0\.749, 0\.812, and 0\.849 for 0\.5B, 3B, and 7B models, validating that our rule\-based reward system enables stable anti\-hallucination alignment even for low\-capacity models\. Using only gold\-standard data \(w/o \(𝒟aug\\mathcal\{D\}\_\{\\text\{aug\}\}\+ GRPO\)\) gives the lowest scores \(e\.g\., 0\.346 for 3B\), showing models cannot capture complex multi\-domain quantitative annotation rules without prior measurement extraction schema scaffolding\. The single first stage \(w/o \(𝒟trace\\mathcal\{D\}\_\{\\text\{trace\}\}\+ GRPO\)\) brings marginal gains from data scaling but is suboptimal, while \(w/o GRPO\) delivers substantial improvements \(e\.g\., 3B score raised to 0\.539\)\. It confirms that the 1st stage initializes quantitative schema adherence, while the 2nd stage enhances generalization across diverse scientific contexts by leveraging multi\-domain scientific knowledge\. ##### Ablation on Reward Components To further understand the contribution of each reward component, we conduct fine\-grained ablation studies by removing individual reward terms from the GRPO objective\. The results are shown in Table[2](https://arxiv.org/html/2604.16929#S4.T2)\. Removing either the out\-of\-scope penalty \(rscoper\_\{\\text\{scope\}\}\) or the fabrication penalty \(rfabr\_\{\\text\{fab\}\}\) leads to consistent performance degradation across all model scales, indicating that both are critical for preventing invalid or noisy extractions and suppressing hallucinated quantities\. These results confirm that different reward components address complementary failure modes, and their combination is necessary to achieve robust hallucination mitigation\. Table 2:Ablation study of reward components for quantity hallucination mitigation \(Mean±\\pmStd\)\. ### 4\.2Effectiveness of Relation\-based Hallucination Mitigation Strategies In this section, we evaluate our relation\-based hallucination mitigation method on the MeasEval dataset andMeasEval\-Ext, a newly annotated dataset containing novel expressions absent from MeasEval, designed to assess the model’s generalization and robustness\. ##### Setup We use theQwen2\.5\-Instructmodels \(0\.5B, 3B, 7B\) to assess performance across parameter scales\. Although MeasEval is a high\-quality benchmark, its limited size and dated sources underrepresent emerging units\. To evaluate robustness under distribution shift, we introduceMeasEval\-Ext, annotated strictly following the MeasEval schema\. We employ an adversarial strategy by selecting recent literature containing novel units and complex expressions absent from the training distribution, rigorously testing model generalization beyond memorized vocabulary\. Annotations followed MeasEval guidelines \(see Appendix[C](https://arxiv.org/html/2604.16929#A3)for agreement analysis\)\. Table 3:Experimental results over the MeasEval Benchmark\. Comparing MeasHalu with competition leaders and rule/sentence\-based LLM baselines\. Top ranks are shaded orange \(1st\), yellow \(2nd\), and teal \(3rd\)\.Table 4:Experimental results over the MeasEval\-Ext\. ##### Results over MeasEval Table[3](https://arxiv.org/html/2604.16929#S4.T3)compares complex quantitative relation extraction on the MeasEval test set\.MeasHalu\-7Bachieves an overall F1 of0\.512, closely matching the competition winnerLIORI\(Davletovet al\.,[2021](https://arxiv.org/html/2604.16929#bib.bib131)\)\(0\.519\)333LIORI uses a six\-model ensemble and does not release weights\., and substantially outperforming other supervised baselines such asCONNER\(Caoet al\.,[2021](https://arxiv.org/html/2604.16929#bib.bib132)\)\(0\.473\) andCounts\(Gangwaret al\.,[2021](https://arxiv.org/html/2604.16929#bib.bib133)\)\(0\.432\)\. Table 5:Ablation study of reward components for relation hallucination mitigation across different model scales \(Mean±\\pmStd\)\.Our model also surpasses state\-of\-the\-art proprietary LLMs \(e\.g\., GPT\-5, Gemini\-2\.5\-Pro\)\. Even with optimized sentence\-based prompting, GPT\-5 reaches only 0\.406 F1, leavingMeasHalu\-7Bahead by over 10 points\. This result highlights the necessity of our quantitative domain alignment pipeline \(SFT \+ composite reward optimization\) for mitigating relational Quantity hallucinations\. Across all baseline LLMs, sentence\-based prompting consistently outperforms rule\-based prompting \(e\.g\., Gemini\-2\.5\-Pro improves from 0\.359 to 0\.440\), supporting our hypothesis that sentence\-level localized reasoning is more effective than rigid global rule\-based deduction for complex quantitative relation extraction\. As shown in Table[4](https://arxiv.org/html/2604.16929#S4.T4), the results onMeasEval\-Extexpose a significant performance gap: while general LLMs exhibit non\-uniform shifts—often struggling with novel expressions—MeasHaludemonstrates robust generalization to unseen distributions, substantially widening its lead over all baselines\. Detailed statistics with standard deviations can be found in Table[13](https://arxiv.org/html/2604.16929#A8.T13)and Table[14](https://arxiv.org/html/2604.16929#A8.T14)\. ##### Ablation on Reward Components To further investigate the contribution of each reward component for relation\-based hallucination mitigation, we conduct fine\-grained ablation studies by removing individual reward terms from the GRPO objective\. The results are presented in Table[5](https://arxiv.org/html/2604.16929#S4.T5)\. Across all model scales \(0\.5B, 3B, and 7B\), removing any reward component consistently leads to performance degradation, indicating that each term plays an essential role in maintaining precise span boundaries, preventing over\-generation, and preserving relational consistency, especially for sparse components such asQualifierandQualifies\. Overall, the fullMeasHalumodels achieve the best performance across all metrics, while ablated variants exhibit reduced robustness in either quantity prediction or relational consistency\. These results confirm that the reward components address complementary aspects of relation hallucination and are jointly necessary for stable extraction\. Additional experiments on cross\-domain generalization are provided in Appendix[G](https://arxiv.org/html/2604.16929#A7)\. ### 4\.3Further Analysis ##### Mechanism of Hallucination Suppression via Entropy Dynamics Inspired byCui and Ding \([2025](https://arxiv.org/html/2604.16929#bib.bib110)\), we quantify Cognitive Hesitation via entropy dynamics, adapting the analysis to our task by distinguishing the quantity group \(Quantity, Unit, Modifier\) from the relation group \(MeasuredEntity, MeasuredProperty, qualifier\)\. We focus on tokens strictly bounded by square brackets \(e\.g\., parsing70 mfrom the tagged sequence\.\.\.surface form \[70 m\]\.\.\.\)\. To capture micro\-level certainty, we report four key statistics: Bracket Entropy Mean \(HBH\_\{B\}\) and Std \(σB\\sigma\_\{B\}\) measure the average confidence level; Spike Rate \(RBR\_\{B\}\) for the proportion of brackets containing high\-entropy tokens; and Sample Spike Ratio \(RsampleR\_\{sample\}\) quantifies the proportion of samples containing at least one high\-risk fluctuation\. Table 6:Fine\-grained entropy statistics by semantic role\.Table[6](https://arxiv.org/html/2604.16929#S4.T6)reveals a clear dichotomy in hallucination suppression across semantic roles\.1\) Quantity Group Stability:SFT is already near\-deterministic \(H≈0\.0071H\\approx 0\.0071\)\. GRPO further compresses residual uncertainty \(H≈0\.0034H\\approx 0\.0034\) with negligible spike fluctuations \(Rsample≈1\.54%R\_\{sample\}\\approx 1\.54\\%\)\.2\) Relation Group Sharpening:GRPO reduces the spike ratio from 33\.85% to 14\.62% and lowers mean entropy by 42\.3%\. These results indicate that relational reasoning, which is highly ambiguous under SFT, becomes substantially more stable under GRPO\. We attribute this improvement to the directed collapse induced by GRPO, which truncates long\-tail uncertainty and enforces convergence toward deterministic facts\. Subsequently, to illustrate the effect of GRPO training on reasoning stability more intuitively, we select high\-entropy points in the reasoning process for a case study\. Details are in Appendix[B](https://arxiv.org/html/2604.16929#A2)\. ##### Application for Embodied AI Tasks To validate the practical value of our fine\-grained extraction for embodied AI, we adapt OpenExpLiuet al\.\([2024](https://arxiv.org/html/2604.16929#bib.bib134)\)to a text\-to\-action generation task, where models generate executable chemical action sequences \(e\.g\.,ADD …\(100 mg\)\) from unstructured experimental text, mimicking real\-world automated laboratory scenarios\. We constructOpenExp\-Action\-100, a dataset of 100 diverse instances, by using unstructured experimental narratives as inputs and OpenExp’s linearized action sequences as gold\-standard outputs\. To enable controlled comparison, we further define three experimental settings: Baseline \(no augmentation\), Gemini\-Aug \(quantity relations extracted by Gemini\) and MeasEval\-Aug \(quantity relations extracted by MeasHalu\)\. Table 7:Performance on OpenExp\-Action\-100 with MeasEval\-formatted quantity–relation context from different sources \(MeasHalu vs\. Gemini\)\. Best scores per model are in bold\.Table[7](https://arxiv.org/html/2604.16929#S4.T7)shows that injecting structured quantity relations significantly improves Structural Validity \(Val, executable/logical consistency\), with MeasEval\-Aug \(82\.3%\) outperforming Gemini\-Aug and Baseline\. The modest BLEU improvement \(19\.8 vs\. 16\.3\) stems from gold\-standard granularity mismatch\. Specifically, OpenExp’s minimalist annotations omit critical details \(e\.g\.,anhydrous\) that our extraction retains\. For embodied AI, structural validity—rather than textual overlap—is pivotal; MeasEval\-formatted extraction ensures this validity by capturing critical details, providing constraints for executable instructions and practical utility for perception\-to\-action pipelines\. Table[15](https://arxiv.org/html/2604.16929#A8.T15)in the Appendix shows the full table with standard deviations\. ## 5Conclusion The proposal ofMeasHalumarks a significant step forward in systematically characterizing measurement hallucinations in large language models within the scientific extraction domain\. Experimental results on both in\-distribution and newly annotated out\-of\-distribution benchmarks \(MeasEval\-Ext\) show thatMeasHalusubstantially improves robustness and consistently outperforms strong supervised baselines and state\-of\-the\-art large language models\. Ultimately,MeasHaluproves to be a reliable external tool that drives significant gains in downstream applications, validating its utility for embodied AI and AI4Science\. ## Limitations Despite advances achieved in this paper, MeasHalu has notable limitations\. First, even though MeasHalu outperforms all existing baselines, the extraction performance for sparse components \(e\.g\., qualifiers, F1 = 0\.170\) remains low, hindered by limited annotated data and ambiguous semantic dependencies in scientific text\. Second, the framework’s generalization to low\-resource languages or domain\-specific jargon \(e\.g\., niche engineering units\) is untested, as current training data focuses on English scientific literature\. Third, processing ultra\-long documents with nested measurement relations may introduce computational inefficiencies, as the sentence\-based reasoning strategy requires contextual localization for each quantity\. ## Acknowledgments This work was partially supported by the National Natural Science Foundation of China \(62406314\), the China Postdoctoral Science Foundation \(2023M733654\), the Guangdong Basic and Applied Basic Research Foundation \(2023A1515110496\)\. ## References - S\. Almasian, V\. Kazakova, P\. Göldner, and M\. Gertz \(2023a\)CQE: a comprehensive quantity extractor\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12845–12859\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.793/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.793)Cited by:[itemrfabr\_\{\\text\{fab\}\}](https://arxiv.org/html/2604.16929#A4.I1.ix3.p1.1)\. - S\. Almasian, V\. Kazakova, P\. Göldner, and M\. Gertz \(2023b\)CQE: a comprehensive quantity extractor\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 12845–12859\.Cited by:[§1](https://arxiv.org/html/2604.16929#S1.p4.1)\. - S\. L\. Berrahou, P\. Buche, J\. Dibie\-Barthelemy, and M\. Roche \(2013\)How to extract unit of measure in scientific documents?\.InSpecial Session on Text Mining,Vol\.2,pp\. 249–256\.Cited by:[§1](https://arxiv.org/html/2604.16929#S1.p1.1)\. - J\. Cao, Y\. Xiang, Y\. Zhang, Z\. Qi, X\. Chen, and Y\. Zheng \(2021\)CONNER: a cascade count and measurement extraction tool for scientific discourse\.InProceedings of the 15th International Workshop on Semantic Evaluation \(SemEval\-2021\),A\. Palmer, N\. Schneider, N\. Schluter, G\. Emerson, A\. Herbelot, and X\. Zhu \(Eds\.\),Online,pp\. 1239–1244\.External Links:[Link](https://aclanthology.org/2021.semeval-1.176/),[Document](https://dx.doi.org/10.18653/v1/2021.semeval-1.176)Cited by:[§4\.2](https://arxiv.org/html/2604.16929#S4.SS2.SSS0.Px2.p1.1)\. - Q\. Chen, M\. Yang, L\. Qin, J\. Liu, Z\. Yan, J\. Guan, D\. Peng, Y\. Ji, H\. Li, M\. Hu, Y\. Zhang, Y\. Liang, Y\. Zhou, J\. Wang, Z\. Chen, and W\. Che \(2025\)AI4Research: a survey of artificial intelligence for scientific research\.External Links:2507\.01903,[Link](https://arxiv.org/abs/2507.01903)Cited by:[§1](https://arxiv.org/html/2604.16929#S1.p1.1)\. - A\. Cohan, F\. Dernoncourt, D\. S\. Kim, T\. Bui, S\. Kim, W\. Chang, and N\. Goharian \(2018\)A discourse\-aware attention model for abstractive summarization of long documents\.Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\)\.External Links:[Link](http://dx.doi.org/10.18653/v1/n18-2097),[Document](https://dx.doi.org/10.18653/v1/n18-2097)Cited by:[§3\.1\.1](https://arxiv.org/html/2604.16929#S3.SS1.SSS1.Px1.p1.6)\. - G\. Cui and N\. Ding \(2025\)The entropy mechanism of reinforcement learning for reasoning language models\.Computing Magazine of the CCF1\(7\),pp\. 26–33\.Cited by:[§4\.3](https://arxiv.org/html/2604.16929#S4.SS3.SSS0.Px1.p1.1)\. - J\. Dagdelen, A\. Dunn, S\. Lee, N\. Walker, A\. S\. Rosen, G\. Ceder, K\. A\. Persson, and A\. Jain \(2024\)Structured information extraction from scientific text with large language models\.Nature communications15\(1\),pp\. 1418\.Cited by:[§1](https://arxiv.org/html/2604.16929#S1.p2.1)\. - M\. H\. Daniel Han and U\. team \(2023\)Unsloth\.External Links:[Link](https://github.com/unslothai/unsloth)Cited by:[§I\.2](https://arxiv.org/html/2604.16929#A9.SS2.p1.1)\. - A\. Davletov, D\. Gordeev, N\. Arefyev, and E\. Davletov \(2021\)LIORI at SemEval\-2021 task 8: ask transformer for measurements\.InProceedings of the 15th International Workshop on Semantic Evaluation \(SemEval\-2021\),A\. Palmer, N\. Schneider, N\. Schluter, G\. Emerson, A\. Herbelot, and X\. Zhu \(Eds\.\),Online,pp\. 1249–1254\.External Links:[Link](https://aclanthology.org/2021.semeval-1.178/),[Document](https://dx.doi.org/10.18653/v1/2021.semeval-1.178)Cited by:[§4\.2](https://arxiv.org/html/2604.16929#S4.SS2.SSS0.Px2.p1.1)\. - L\. Foppiano, G\. Lambard, T\. Amagasa, and M\. Ishii \(2024\)Mining experimental data from materials science literature with large language models: an evaluation study\.Science and Technology of Advanced Materials: Methods4\(1\),pp\. 2356506\.Cited by:[§1](https://arxiv.org/html/2604.16929#S1.p1.1)\. - A\. Gangwar, S\. Jain, S\. Sourav, and A\. Modi \(2021\)Counts@IITK at SemEval\-2021 task 8: SciBERT based entity and semantic relation extraction for scientific data\.InProceedings of the 15th International Workshop on Semantic Evaluation \(SemEval\-2021\),A\. Palmer, N\. Schneider, N\. Schluter, G\. Emerson, A\. Herbelot, and X\. Zhu \(Eds\.\),Online,pp\. 1232–1238\.External Links:[Link](https://aclanthology.org/2021.semeval-1.175/),[Document](https://dx.doi.org/10.18653/v1/2021.semeval-1.175)Cited by:[§4\.2](https://arxiv.org/html/2604.16929#S4.SS2.SSS0.Px2.p1.1)\. - M\. A\. Hanson, P\. G\. Barreiro, P\. Crosetto, and D\. Brockington \(2024\)The strain on scientific publishing\.Quantitative Science Studies5\(4\),pp\. 823–843\.External Links:ISSN 2641\-3337,[Document](https://dx.doi.org/10.1162/qss%5Fa%5F00327),[Link](https://doi.org/10.1162/qss_a_00327),https://direct\.mit\.edu/qss/article\-pdf/5/4/823/2478590/qss\_a\_00327\.pdfCited by:[§1](https://arxiv.org/html/2604.16929#S1.p1.1)\. - C\. Harper, J\. Cox, C\. Kohler, A\. Scerri, R\. Daniel Jr, and P\. Groth \(2021a\)SemEval\-2021 task 8: measeval–extracting counts and measurements and their related contexts\.InProceedings of the 15th International Workshop on Semantic Evaluation \(SemEval\-2021\),pp\. 306–316\.Cited by:[§2\.3](https://arxiv.org/html/2604.16929#S2.SS3.p1.1)\. - C\. Harper, J\. Cox, C\. Kohler, A\. Scerri, R\. Daniel Jr\., and P\. Groth \(2021b\)SemEval\-2021 task 8: MeasEval – extracting counts and measurements and their related contexts\.InProceedings of the 15th International Workshop on Semantic Evaluation \(SemEval\-2021\),A\. Palmer, N\. Schneider, N\. Schluter, G\. Emerson, A\. Herbelot, and X\. Zhu \(Eds\.\),Online,pp\. 306–316\.External Links:[Link](https://aclanthology.org/2021.semeval-1.38/),[Document](https://dx.doi.org/10.18653/v1/2021.semeval-1.38)Cited by:[§1](https://arxiv.org/html/2604.16929#S1.p4.1)\. - J\. He, D\. Q\. Nguyen, S\. A\. Akhondi, C\. Druckenbrodt, C\. Thorne, R\. Hoessel, Z\. Afzal, Z\. Zhai, B\. Fang, H\. Yoshikawa,et al\.\(2021\)Chemu 2020: natural language processing methods are effective for information extraction from chemical patents\.Frontiers in Research Metrics and Analytics6,pp\. 654438\.Cited by:[Appendix G](https://arxiv.org/html/2604.16929#A7.p1.1)\. - L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin,et al\.\(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems43\(2\),pp\. 1–55\.Cited by:[§2\.1](https://arxiv.org/html/2604.16929#S2.SS1.p1.1)\. - Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung \(2023\)Survey of hallucination in natural language generation\.ACM computing surveys55\(12\),pp\. 1–38\.Cited by:[§2\.1](https://arxiv.org/html/2604.16929#S2.SS1.p1.1)\. - O\. Kononova, T\. He, H\. Huo, A\. Trewartha, E\. A\. Olivetti, and G\. Ceder \(2021\)Opportunities and challenges of text mining in materials research\.iScience24\(3\)\.External Links:ISSN 2589\-0042,[Document](https://dx.doi.org/10.1016/j.isci.2021.102155),[Link](https://doi.org/10.1016/j.isci.2021.102155)Cited by:[§1](https://arxiv.org/html/2604.16929#S1.p1.1)\. - P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2604.16929#S1.p2.1),[§2\.4](https://arxiv.org/html/2604.16929#S2.SS4.p1.1)\. - C\. Li, P\. Wang, C\. Wang, L\. Zhang, Z\. Liu, Q\. Ye, Y\. Xu, F\. Huang, X\. Zhang, and P\. S\. Yu \(2025\)Loki’s dance of illusions: a comprehensive survey of hallucination in large language models\.arXiv preprint arXiv:2507\.02870\.Cited by:[§2\.1](https://arxiv.org/html/2604.16929#S2.SS1.p1.1)\. - H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2\.4](https://arxiv.org/html/2604.16929#S2.SS4.p1.1)\. - Z\. Liu, Y\. Shi, A\. Zhang, S\. Li, E\. Zhang, X\. Wang, K\. Kawaguchi, and T\. Chua \(2024\)ReactXT: understanding molecular “reaction\-ship” via reaction\-contextualized molecule\-text pretraining\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 5353–5377\.External Links:[Link](https://aclanthology.org/2024.findings-acl.318/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.318)Cited by:[§4\.3](https://arxiv.org/html/2604.16929#S4.SS3.SSS0.Px2.p1.1)\. - Y\. Luan, L\. He, M\. Ostendorf, and H\. Hajishirzi \(2018\)Multi\-task identification of entities, relations, and coreference for scientific knowledge graph construction\.arXiv preprint arXiv:1808\.09602\.Cited by:[§2\.3](https://arxiv.org/html/2604.16929#S2.SS3.p1.1)\. - D\. Nadeau and S\. Sekine \(2007\)A survey of named entity recognition and classification\.Lingvisticae Investigationes30\(1\),pp\. 3–26\.Cited by:[§2\.2](https://arxiv.org/html/2604.16929#S2.SS2.p1.1)\. - L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§2\.4](https://arxiv.org/html/2604.16929#S2.SS4.p1.1)\. - M\. P\. Polak and D\. Morgan \(2024\)Extracting accurate materials data from research papers with conversational language models and prompt engineering\.Nature Communications15\(1\),pp\. 1569\.Cited by:[§1](https://arxiv.org/html/2604.16929#S1.p2.1)\. - R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§2\.4](https://arxiv.org/html/2604.16929#S2.SS4.p1.1)\. - T\. Saier, M\. Ohta, T\. Asakura, and M\. Färber \(2024\)HyperPIE: hyperparameter information extraction from scientific publications\.InAdvances in Information Retrieval,N\. Goharian, N\. Tonellotto, Y\. He, A\. Lipani, G\. McDonald, C\. Macdonald, and I\. Ounis \(Eds\.\),Cham,pp\. 254–269\.External Links:ISBN 978\-3\-031\-56060\-6Cited by:[§1](https://arxiv.org/html/2604.16929#S1.p2.1)\. - J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§2\.4](https://arxiv.org/html/2604.16929#S2.SS4.p1.1)\. - A\. Xu, M\. Tan, L\. Wang, M\. Yang, and R\. Xu \(2024\)NUMCoT: numerals and units of measurement in chain\-of\-thought reasoning using large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 14268–14290\.Cited by:[§2\.3](https://arxiv.org/html/2604.16929#S2.SS3.p1.1)\. - Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, Z\. Luo, Z\. Feng, and Y\. Ma \(2024\)LlamaFactory: unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),Bangkok, Thailand\.External Links:[Link](http://arxiv.org/abs/2403.13372)Cited by:[§I\.1](https://arxiv.org/html/2604.16929#A9.SS1.p1.1)\. - C\. Zhou, P\. Liu, P\. Xu, S\. Iyer, J\. Sun, Y\. Mao, X\. Ma, A\. Efrat, P\. Yu, L\. Yu,et al\.\(2023\)Lima: less is more for alignment\.Advances in Neural Information Processing Systems36,pp\. 55006–55021\.Cited by:[§2\.4](https://arxiv.org/html/2604.16929#S2.SS4.p1.1)\. ## Appendix APrompt template Prompt for𝒫trace\\mathcal\{P\}\_\{\\text\{trace\}\}Instruction:You are an expert in extracting structured annotations from text\. I have an text input and you need to extract all the quantities within it\. I need you to strictly follow the format with six specific sections: ARABIC\-QUANTITY, NUMERIC\-QUANTITY, TIME\-QUANTITY, CHANGE\-QUANTITY, CHANGE\-QUANTITY, FORMULA\-QUANTITY, CONCLUSION\.To explain further: In ARABIC\-QUANTITY, outline a step\-by\-step thought process you use to extract quantity in arabic form\. In NUMERIC\-QUANTITY, outline a step by step thought process …In CONCLUSION, give the final answer in a tsv format explained below\.I will provide you with the quantities extracted using the quantulum library for your reference, the information provided by Quantulum is standardized\. You need to find the original text in the passage and fill in the tsv form\. Also, the quantulum information maybe incorrect, You can’t follow it completely\.Here’s how the format should look: <ARABIC\-QUANTITY\> \[Provide a chain\-of\-thought explanation of how you extract all quantities in the arabic forms\] </ARABIC\-QUANTITY\> <NUMERIC\-QUANTITY\> …<CONCLUSION\>\[State the final answer in a tsv format explained below format…\] </CONCLUSION\>Task Definition: Extract Quantities1\. Annotation of Quantities: …2\. Example Process: …Output Format \(TSV Fields\): …Final Output Example: …The reference answer from quantulum:… Prompt for𝒫aug\\mathcal\{P\}\_\{\\text\{aug\}\}Instruction:You are an expert in extracting structured annotations from text\. I have an text input and you need to extract all the quantities within it\. I need you to strictly follow the format with six specific sections: ARABIC\-QUANTITY, NUMERIC\-QUANTITY, TIME\-QUANTITY, CHANGE\-QUANTITY, CHANGE\-QUANTITY, FORMULA\-QUANTITY, CONCLUSION\.To explain further: In ARABIC\-QUANTITY, outline a step\-by\-step thought process you use to extract quantity in arabic form\. In NUMERIC\-QUANTITY, outline a step by step thought process …In CONCLUSION, give the final answer in a tsv format explained below\. It is crucial that you adhere to this structure exactly as outlined and that the final answer in the CONCLUSION matches the standard correct answer precisely\.Here’s how the format should look: <ARABIC\-QUANTITY\> \[Provide a chain\-of\-thought explanation of how you extract all quantities in the arabic forms\] </ARABIC\-QUANTITY\> <NUMERIC\-QUANTITY\> …<CONCLUSION\>\[State the final answer in a tsv format explained below format…It is crucial that you adhere to this structure exactly as outlined and that the final answer\] </CONCLUSION\>Task Definition: Extract Quantities1\. Annotation of Quantities: …2\. Example Process: …Output Format \(TSV Fields\): …Final Output Example: …The gold answers:… ## Appendix BCase Study: High\-Entropy Token Suppression by GRPO To better understand the effect of GRPO, we examine a representative sample: ##### Input: > Samples were then annealed in air in a pre\-heated furnace at temperatures up to 798 °C for times chosen to ensure complete iron diffusion through the sample\. ##### Gold Standard Annotation: - •Quantity:up to 798 °C - •Unit:°C - •Modifiers:\["IsRange"\] - •MeasuredProperty:annealed in air in a pre\-heated furnace at temperatures - •MeasuredEntity:Samples ##### SFT Output: > We can find the quantity with surface form \[up to 798 °C\], it has unit \[°C\]\. The modifier for the quantity are \[IsRange\]\. This quantity is used to describe the entity \[furnace\]\. The entity has the following property \[temperatures\]\. Here, the red token indicates a high\-entropy token \(entropy = 1\.39\), and the top\-5 candidate tokens with probabilities are listed in the first column of Table[8](https://arxiv.org/html/2604.16929#A2.T8)\. ##### GRPO Output: > We can find the quantity with surface form \[up to 798 °C\], it has unit \[°C\]\. The modifier for the quantity are \[IsRange\]\. This quantity is used to describe the entity \[Samples\]\. The entity has the following property \[temperatures\]\. There is no high\-entropy token in the model’s output after the GRPO training\. The top\-5 candidate tokens with probabilities at the \[Samples\] position are shown in the second column of Table[8](https://arxiv.org/html/2604.16929#A2.T8)\. ##### Analysis: GRPO successfully suppresses the high\-entropy token observed in SFT, assigning the correct tokenSampleswith high confidence and eliminating uncertainty, demonstrating improved reasoning stability and more deterministic output\. Table 8:Comparison of Top\-5 candidate tokens at the target position between SFT and GRPO outputs\. ## Appendix CMeasEval\-Ext and its Annotation Details The annotations are drawn from recent research papers that postdate the original MeasEval corpus\. A distinct advantage of this data source is its adversarial selection strategy: unlike the randomized distribution in the original dataset, we deliberately curated 135 text segments\(the same as the MeasEval evaluation dataset\) containingnovel units and complex quantity expressions absent from the training distribution\. This design ensures that the dataset strictly tests the model’s ability to identify and ground quantities based on semantic context rather than memorized vocabulary\. To ensure high data quality, we enlisted researchers from our laboratory as annotators\. The annotation process strictly followed the officialMeasEval Annotation Guidelines\. All samples wereindependently labeled by two annotatorsto capture the dense quantity\-centric information\. Following the initial annotation, results were reviewed and reconciled during anadjudication meetingto resolve disagreements and reach a final consensus\. The consistency of the dataset is validated by the Inter\-Annotator Agreement \(IAA\)\. As shown in Table[9](https://arxiv.org/html/2604.16929#A3.T9), the Krippendorff’s Alpha scores \(e\.g\., 0\.921 for Quantity\) indicate strong agreement, comparable to the original MeasEval benchmarks\. Table 9:Inter\-Annotator agreement \(Krippendorff’s Alpha\) for MeasEval\-Ext ## Appendix DQuantity Phase Reward The total rewardR\(y\)R\(y\)is a weighted sum of four components for mitigating distinct Quantity hallucination types, which includes the Format Reward \(rfmtr\_\{\\text\{fmt\}\}\), the Out\-of\-Scope Penalty \(rscoper\_\{\\text\{scope\}\}\), the Fabrication Penalty \(rfabr\_\{\\text\{fab\}\}\) and the Misclassification Penalty \(rmisr\_\{\\text\{mis\}\}\): R\(y\)=rfmt\(y\)\+rscope\(y\)\+rfab\(y\)\+rmis\(y\)R\(y\)=r\_\{\\text\{fmt\}\}\(y\)\+r\_\{\\text\{scope\}\}\(y\)\+r\_\{\\text\{fab\}\}\(y\)\+r\_\{\\text\{mis\}\}\(y\)\(7\) 1. rfmtr\_\{\\text\{fmt\}\}To enforce output schema compliance, we validate the sequential semantic tags𝒮tags=\{<ARABIC\>,…,<CONCLUSION\>\}\\mathcal\{S\}\_\{\\text\{tags\}\}=\\\{\\texttt\{<ARABIC\>\},\\dots,\\texttt\{<CONCLUSION\>\}\\\}via regex pattern𝒫struct\\mathcal\{P\}\_\{\\text\{struct\}\}\. The binary reward is: rfmt\(y\)=𝕀\(y≡𝒫struct\)r\_\{\\text\{fmt\}\}\(y\)=\\mathbbm\{I\}\\Big\(y\\equiv\\mathcal\{P\}\_\{\\text\{struct\}\}\\Big\)\(8\) 2. rscoper\_\{\\text\{scope\}\}Constrains out\-of\-scope entities\(e\.g\., “Fig\. 1”\) via local patterns𝒞\(e\)\\mathcal\{C\}\(e\)and global precisionPansP\_\{\\text\{ans\}\}: rscope=−λloc∑e𝒞\(e\)\+βscopePansr\_\{\\text\{scope\}\}=\-\\lambda\_\{\\text\{loc\}\}\\sum\_\{e\}\\mathcal\{C\}\(e\)\+\\beta\_\{\\text\{scope\}\}P\_\{\\text\{ans\}\}\(9\) 3. rfabr\_\{\\text\{fab\}\}Prohibiting invalid quantity fabrication via parsers combining CQE\(Almasianet al\.,[2023a](https://arxiv.org/html/2604.16929#bib.bib135)\)and Quantulum444[https://github\.com/nielstron/quantulum3](https://github.com/nielstron/quantulum3)𝒯parse\\mathcal\{T\}\_\{\\text\{parse\}\}, the penalty includes grounding constraints: rfab\(y\)\\displaystyle r\_\{\\text\{fab\}\}\(y\)=−λfab∑e∈ℰy𝕀\(𝒯parse\(e\)=∅\)\\displaystyle=\-\\lambda\_\{\\text\{fab\}\}\\sum\_\{e\\in\\mathcal\{E\}\_\{y\}\}\\mathbbm\{I\}\\Big\(\\mathcal\{T\}\_\{\\text\{parse\}\}\(e\)=\\emptyset\\Big\)\(10\) 4. rmisr\_\{\\text\{mis\}\}Mitigating span boundary errors via token\-level precisionPtokP\_\{\\text\{tok\}\}, the reward is: rmis\(y\)=F1¯tok−λmis⋅\(1−Ptok\)r\_\{\\text\{mis\}\}\(y\)=\\bar\{F1\}\_\{\\text\{tok\}\}\-\\lambda\_\{\\text\{mis\}\}\\cdot\(1\-P\_\{\\text\{tok\}\}\)\(11\) ## Appendix ERelation Phase Reward While the sentence\-based extraction excels at local entity identification \(e\.g\., Units, Modifiers\), it suffers from failures in capturing long\-range dependency chains \(e\.g\.,MeasuredEntity,MeasuredProperty,Qualifier\), inference bias, and under\-extraction of sparse components\. To address these issues, enforce logical completeness, suppress Quantity hallucinations, and incentivize sparse component retrieval, we design a composite reward functionR\(y\)R\(y\)optimized via GRPO\. The total reward is a weighted sum of three dedicated components \(Format & Grounding Reward \(rfmtr\_\{\\text\{fmt\}\}\)\), Relation Completeness & Exploration Reward \(rcompr\_\{\\text\{comp\}\}\), and Misclassification Penalty \(rmisr\_\{\\text\{mis\}\}\)\), that target distinct quantitative extraction flaws: R\(y\)=rfmt\(y\)\+rcomp\(y\)\+rmis\(y\)R\(y\)=r\_\{\\text\{fmt\}\}\(y\)\+r\_\{\\text\{comp\}\}\(y\)\+r\_\{\\text\{mis\}\}\(y\)\(12\) The reward components are elaborated as follows with explicit optimization objectives and mathematical formulations: 1. rfmtr\_\{\\text\{fmt\}\}To enforce structural consistency, we validate the existence of analysis sections𝒮y\\mathcal\{S\}\_\{y\}and adherence to the SFT schemaℱSFT\\mathcal\{F\}\_\{\\text\{SFT\}\}\. The binary reward is: rfmt\(y\)=𝕀\(𝒮y≠∅∧y⊧ℱSFT\)r\_\{\\text\{fmt\}\}\(y\)=\\mathbbm\{I\}\\Big\(\\mathcal\{S\}\_\{y\}\\neq\\emptyset\\land y\\models\\mathcal\{F\}\_\{\\text\{SFT\}\}\\Big\)\(13\) 2. rcompr\_\{\\text\{comp\}\}Drivescomprehensive explorationby aligning predicted groupsppwith gold groupsgg\. To prevent partial extraction, we incentivize full recovery via stepwise matching, closure bonuses, and weighted component bonuses: rcomp\(y\)=∑p∼g\(λstep\|p∩g\|⏟Stepwise\+βfull𝕀\(g⊆p\)⏟Closure\)\+λexp∑cwcF1cans⏟Weighted Exploration\\begin\{split\}r\_\{\\text\{comp\}\}\(y\)&=\\sum\_\{p\\sim g\}\\Big\(\\underbrace\{\\lambda\_\{\\text\{step\}\}\|p\\cap g\|\}\_\{\\text\{Stepwise\}\}\+\\underbrace\{\\beta\_\{\\text\{full\}\}\\mathbbm\{I\}\(g\\subseteq p\)\}\_\{\\text\{Closure\}\}\\Big\)\\\\ &\+\\underbrace\{\\lambda\_\{\\text\{exp\}\}\\sum\\nolimits\_\{c\}w\_\{c\}F1\_\{c\}^\{\\text\{ans\}\}\}\_\{\\text\{Weighted Exploration\}\}\\end\{split\}\(14\)where weights𝐰\\mathbf\{w\}prioritize harder\-to\-predict dependencies to ensure no critical node is missed\. 3. rmisr\_\{\\text\{mis\}\}Suppresses over\-broad spans by penalizing token\-level precision loss\(1−Ptok\)\(1\-P\_\{\\text\{tok\}\}\): rmis\(y\)=F1tok−\(1−Ptok\)r\_\{\\text\{mis\}\}\(y\)=F1\_\{\\text\{tok\}\}\-\(1\-P\_\{\\text\{tok\}\}\)\(15\) Table 10:Taxonomy of Hallucinations in Information Extraction ## Appendix FEfficiency Analysis of Sentence\-Based Reasoning To analyze the efficiency of the proposed sentence\-based reasoning strategy, we compare it with the rule\-based method in terms of token consumption\. On the MeasEval benchmark, the sentence\-based approach requires an average of 871 tokens per response, which is a 27% reduction compared to the 1,193 tokens used by the rule\-based method\. This indicates that our method improves inference efficiency rather than introducing additional overhead\. This efficiency gain can be attributed to the use of local context anchoring\. While rule\-based approaches often rely on constructing global reasoning chains across the entire input, our method focuses on extracting relevant elements from localized sentence\-level contexts, where most measurement relations are concentrated\. This reduces redundant reasoning paths and avoids unnecessary global search, leading to more efficient inference without compromising performance\. ## Appendix GGeneralization to Unseen Domains To evaluate whether our method relies on dataset\-specific priors or learns generalizable hallucination mitigation capabilities for extraction, we further assessMeasHaluon an unseen\-domain benchmark, ChEMU\-NERHeet al\.\([2021](https://arxiv.org/html/2604.16929#bib.bib127)\), without additional fine\-tuning\. The ChEMU dataset consists of chemical patent texts annotated with entity roles such as reaction products, starting materials, and experimental conditions \(e\.g\., time, temperature, and yield\)\. These elements exhibit partial semantic alignment with the MeasEval schema, enabling evaluation via direct schema mapping\. We compare two prompt strategies: \(1\)Task\-Specific Optimized, where prompts are tailored to the chemical domain, and \(2\)MeasEval Style, where models are evaluated using our generalized extraction framework without domain\-specific adaptation\. Table 11:Performance comparison under different prompt strategies\.As shown in Table[11](https://arxiv.org/html/2604.16929#A7.T11), MeasHalu achieves the best F1 score \(0\.7570\) under the MeasEval\-style setting, outperforming strong baselines such as DeepSeek\-R1 and DeepSeek\-V3\. Notably, MeasHalu attains significantly higher precision \(0\.8465\), indicating strong robustness in suppressing hallucinated predictions in unseen domains\. Although recall is lower due to schema mismatch and incomplete mapping, the overall performance demonstrates thatMeasHalulearns transferable hallucination mitigation capabilities for extraction rather than relying on dataset\-specific priors\. ## Appendix HAnalysis of Sparse Component Extraction under Relaxed Matching Extracting sparse elements such as qualifiers remains a challenging problem in scientific measurement extraction\. This difficulty largely stems from the ambiguous semantic dependencies in scientific texts, where such elements often lack clear syntactic boundaries and exhibit high variability in expression\. Under strict span\-based evaluation \(e\.g\., Strict Overlap F1\), even minor boundary deviations can lead to substantial performance penalties, despite the model correctly identifying the core semantic region\. For example, partial span mismatches \(e\.g\., “with respect to earth” vs\. “respect to earth”\) are treated as complete errors\. To better reflect semantic correctness, we further evaluate model performance using a relaxed matching criterion following the MeasEval protocol, which allows partial span overlap\. Under this metric, MeasHalu\-7B achieves an F1 score of 0\.315, indicating strong localization capability despite boundary ambiguity\. Table 12:Performance under relaxed matching for sparse components\.These results suggest that, although strict span\-based metrics indicate low performance on sparse elements, the model is capable of accurately localizing the relevant semantic regions\. Furthermore, the relatively high precision demonstrates effective suppression of hallucinations even under challenging extraction settings\. Table 13:Experimental results over the MeasEval Benchmark\. Comparing MeasHalu with competition leaders and rule/sentence\-based LLM baselines\. Top ranks are shaded orange \(1st\), yellow \(2nd\), and teal \(3rd\)\.Table 14:Experimental results over the MeasEval\-Ext\.Table 15:Performance comparison on the OpenExp\-Action\-100 dataset\. Models are provided withMeasEval\-formatted quantities and relationsgenerated by different sources \(MeasHalu vs\. Gemini\) as context\. The best scoresfor each modelare highlighted inbold\.Figure 3:Comparison of sentence\-based and rule\-based reasoning approaches ## Appendix IImplementation Details ### I\.1Supervised Fine\-Tuning \(SFT\) For the Supervised Fine\-Tuning \(SFT\) stage in MeasHalu, we adopt the LlamaFactoryZhenget al\.\([2024](https://arxiv.org/html/2604.16929#bib.bib128)\)framework for model training\. When applying parameter\-efficient fine\-tuning with Low\-Rank Adaptation \(LoRA\), the training is conducted on a single NVIDIA A800 GPU\. In contrast, full\-parameter fine\-tuning requires increased computational resources and is therefore performed using two NVIDIA A800 GPUs\. ### I\.2GRPO Training For the GRPO stage, we utilize the Unsloth frameworkDaniel Han and team \([2023](https://arxiv.org/html/2604.16929#bib.bib129)\)to improve training efficiency\. The entire GRPO training process is carried out on a single NVIDIA A800 GPU\.
Similar Articles
ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning
ClinHallu is a benchmark for diagnosing and mitigating hallucinations in medical multimodal large language models by decomposing reasoning into visual recognition, knowledge recall, and reasoning integration stages, using trace-supervised fine-tuning to reduce errors.
HalluScore: Large Language Model Hallucination Question Answering Benchmark
Introduces HalluScore, a structured Arabic QA benchmark for evaluating hallucination in LLMs across reasoning difficulty, knowledge domains, and cultural contexts. Contains 827 questions with verified evidence and annotations, tested on 17 LLMs.
HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders
Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.
HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
HalluWorld is a controlled benchmark framework for evaluating hallucination in large language models using explicit reference world models across synthetic environments like gridworlds, chess, and realistic terminal tasks. It enables fine-grained analysis of failure modes such as perceptual hallucination, multi-step state tracking, and causal simulation, revealing that frontier models still struggle with complex reasoning not solved by extended thinking.
From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data
This paper analyzes hallucination in large language models as a structural consequence of three architectural decisions: self-attention's co-occurrence learning, maximum likelihood estimation training objective, and autoregressive decoding's left-to-right commitment. It maps each mechanism to specific hallucination types and argues that dataset pathologies amplify but do not cause these vulnerabilities.