LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
Summary
This paper introduces LGMT, a framework that uses first-order logic to generate semantically invariant test cases for evaluating LLM reasoning reliability. Experiments on six LLMs show that LGMT exposes hidden defects missed by static benchmarks, suggesting evaluation should focus on robustness under logical invariance.
View Cached Full Text
Cached at: 05/26/26, 09:03 AM
# LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs Source: [https://arxiv.org/abs/2605.23965](https://arxiv.org/abs/2605.23965) [View PDF](https://arxiv.org/pdf/2605.23965) > Abstract:Large Language Models \(LLMs\) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain\. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability\. We propose LGMT \(Logic\-Grounded Metamorphic Testing\), an oracle\-free framework that leverages first\-order logic \(FOL\) to evaluate LLM reasoning\. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross\-case consistency checking\. Experiments on six state\-of\-the\-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference\-based evaluations\. We further find that models are particularly sensitive to symbol\-level and conclusion\-level variations, and that advanced prompting such as Few\-shot CoT only partially mitigates these issues\. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance\. LGMT provides a principled and scalable approach for diagnosing reasoning failures\. ## Submission history From: Zenghui Zhou \[[view email](https://arxiv.org/show-email/44c8ae13/2605.23965)\] **\[v1\]**Tue, 12 May 2026 18:26:59 UTC \(2,359 KB\)
Similar Articles
Logic-Regularized Verifier Elicits Reasoning from LLMs
Introduces LoVer, an unsupervised verifier that uses logical rules (negation consistency, intra-group and inter-group consistency) to improve LLM reasoning without labeled data, achieving performance close to supervised verifiers on reasoning benchmarks.
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval-Logic is a new Chinese benchmark for evaluating logical reasoning in LLMs, featuring solver-verified answers and adversarial hardening. The benchmark reveals significant gaps in current models, with the best reaching only 37.5% accuracy on hard items.
Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics
This paper introduces a methodology to enrich scientific logicality in LLM reasoning, including assessment criteria and data sampling methods, and demonstrates its effectiveness on physics problems using multiple backbone LLMs.
Geometric Latent Reasoning Induces Shorter Generations in LLMs
Geometric Latent Reasoning (GLR) introduces a geometric path-approximation method for latent reasoning in LLMs, enabling shorter generations while maintaining accuracy across mathematical reasoning benchmarks.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
This empirical study evaluates LLMs on the Equivalence Class Problem to assess long-chain reasoning capabilities, finding that non-reasoning models fail while reasoning models struggle with specific structural difficulties.