Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

arXiv cs.CL Papers

Summary

This paper presents a systematic human audit of NL-to-FOL datasets FOLIO and MALLS, finding 39% and 36% incorrect formalizations respectively. It releases corrected ground truths and an LLM-assisted framework to focus human relabeling, reducing the review workload to under 24% of instances for 90% accuracy.

arXiv:2606.02837v1 Announce Type: new Abstract: Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:35 AM

# Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
Source: [https://arxiv.org/html/2606.02837](https://arxiv.org/html/2606.02837)
Michele Mignani1∗Angelo Montanari1Nicola Saccomanno1 1University of Udine, Italy name\.surname@uniud\.it ∗Corresponding author

###### Abstract

Accurate translation from Natural Language to First\-Order Logic \(NL\-to\-FOL\) underpins neurosymbolic AI systems and Natural Language Inference \(NLI\), making the quality of NL\-to\-FOL benchmarks essential—yet these datasets have never been rigorously audited\. Our first contribution is to present a systematic human inspection of the validation split ofFOLIOand a subset ofMALLStest instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations \(i\.e\., ground truth labels\), with additional rates of ambiguous NL sentences \(16\.4% and 48%\) and incorrect NLI labels inFOLIO\(8\.4%\)\. Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state\-of\-the\-art LLMs \(Gemma 4 31B\-it, Qwen3\-30B\-A3B, and GPT\-4o\-mini\) with the corrected ground truths yields accuracy gains from \+9 to \+22 percentage points\. Motivated by these findings, we propose an LLM\-based framework to support humans in manual reviewing NL\-to\-FOL datasets\. By directing reviewers toward the most error\-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review\. We release all human\-verified annotations and the code for our framework\.

Fixing FOLIO and MALLS: Verified Annotations and an LLM\-assisted Framework to Focus Human Relabeling

Andrea Brunello1Cristian Curaba1Luca Geatti1

Michele Mignani1∗Angelo Montanari1Nicola Saccomanno11University of Udine, Italyname\.surname@uniud\.it∗Corresponding author\.

## 1Introduction

Automatically translating Natural Language \(NL\) into a machine\-readable formalism—often calledautoformalization—is a fundamental building block of neurosymbolic AI, with applications ranging from Natural Language Inference \(NLI\)\(DBLP:conf/nips/YeCDD23;DBLP:conf/emnlp/OlaussonGLZSTL23;DBLP:conf/emnlp/PanAWW23\)to runtime verification and AI safety\(Toward\_guaranteed\_safe\_AI\)\. Among the target formalisms, First\-Order Logic \(FOL\) stands out for its expressiveness and computational tractability\. Yet, NL\-to\-FOL translation111See Appendix[A](https://arxiv.org/html/2606.02837#A1)for a more formal definition\.remains a longstanding challenge for both humans and automated systems\(barker2009difficulty;singh2020exploring\)\.

To support progress in this area, datasets have been developed to facilitate training of NL\-to\-FOL translation systems and enable their systematic comparison\. In practice, however, only a limited number of partially human\-curated resources are publicly available, most notably FOLIO\(DBLP:conf/emnlp/HanS0QRZCPQBSWS24\)and MALLS\(DBLP:conf/acl/YangXPSF24\)\. Prior work has examined limitations in evaluation metrics and task design\(brunello2026llms\), but the reference annotations themselves have received little systematic attention\.

In this paper, we show that these annotations contain significant errors\. We conduct a systematic human audit of theFOLIOvalidation split and a subset ofMALLStest instances, finding that 39% and 36% of examples, respectively, contain incorrect FOL formalizations\. Additionally, 16\.4% ofFOLIOand 48% ofMALLSinclude NL statements that admit multiple, non\-equivalent yet defensible interpretations \(Section[2](https://arxiv.org/html/2606.02837#S2)\)\. These annotation errors have a direct and notable impact on evaluation: re\-assessing state\-of\-the\-art models \(Gemma 4 31B\-it, Qwen3\-30B\-A3B, GPT\-4o\-mini\) against our corrected ground truth yields accuracy gains of \+9 to \+22 percentage points \(Section[2\.3](https://arxiv.org/html/2606.02837#S2.SS3)\)\.

Exhaustive manual review, however, does not scale to large datasets or real\-world deployments\. We therefore introduce an LLM\-assisted oversight framework that focuses human effort on the instances most likely to be incorrect\. The approach exploits a key empirical finding: LLMs rarely flag a correct formalization as wrong—in theFOLIOvalidation split, Gemma marks only∼3%\{\\sim\}3\\%of correct instances as incorrect \(see Appendix[I](https://arxiv.org/html/2606.02837#A9)\)\. Directing annotators to instances that the model deems incorrect, our framework enables 90% annotation accuracy while requiring review of only∼24%\{\\sim\}24\\%ofFOLIOvalidation instances and∼8%\{\\sim\}8\\%of theMALLStest subset\. Moreover, when applied to the high\-quality private datasetGGC\(barker2011student\), it introduces negligible additional noise, confirming its reliability when errors are rare\.

Our framework finds natural application across several settings where high\-quality FOL corpora are required but exhaustive human verification is expensive\. First, in*dataset curation*, rather than subjecting every instance to costly expert review, curators can focus exclusively on the small fraction flagged as suspicious, turning the process into a much more tractable task\. Second, in*formal methods*and verification, the framework can assist experts by identifying likely incorrect formalizations of NL specifications, acting as a warning layer\.

Finally, in*runtime verification*scenarios\(DBLP:journals/corr/abs\-2504\-21022\), where newly defined behavioral constraints must be checked immediately against live data, there is no time for manual review; the framework provides on\-the\-fly feedback on whether a generated formula faithfully captures the intended requirement\.

#### Contributions\.

1. 1\.Corrected datasets:revised FOL formalizations and NLI labels \(when applicable\) for theFOLIOvalidation split \(275 instances\) and for aMALLStest subset \(100 instances\) \(Section[2](https://arxiv.org/html/2606.02837#S2)\)\.
2. 2\.Evaluation impact:accuracy gains \(from \+9 to \+22 percentage points\) when state\-of\-the\-art LLMs are assessed against the corrected ground truth \(Section[2\.3](https://arxiv.org/html/2606.02837#S2.SS3)\)\.
3. 3\.LLM\-assisted oversight framework:a novel strategy that directs human annotation effort toward instances most likely to contain errors, substantially reducing review cost without sacrificing quality of the revision \(Section[3](https://arxiv.org/html/2606.02837#S3)\)\.

To support reproducibility and foster future research, we publicly release the corrected dataset at[https://huggingface\.co/DSAVlab\-UNIUD](https://huggingface.co/DSAVlab-UNIUD); code is not released at this time\.

## 2Datasets and Their Quality Analysis

We analyze errors and ambiguities in theFOLIOvalidation split\(DBLP:conf/emnlp/HanS0QRZCPQBSWS24\)and in a subset of theMALLStest set\(DBLP:conf/acl/YangXPSF24\), and describe our methodology for human verification and correction\. As shown in Table[1](https://arxiv.org/html/2606.02837#S2.T1), approximately 39% ofFOLIOand 36% ofMALLSinstances contain annotation errors, with direct consequences for model evaluation \(Section[2\.3](https://arxiv.org/html/2606.02837#S2.SS3)\)\.

Table 1:Errors and ambiguities identified during human curation\.Incorrect FOL sentencescounts formalizations that are wrong in the original datasets; subcategories are not mutually exclusive\.### 2\.1Datasets

FOLIOis an NLI\-through\-FOL benchmark: each instance contains a multi\-sentence story \(premises\) and a conclusion, both in NL and FOL, paired with an NLI label \(True/False/Unknown\) indicating whether the conclusion is entailed by, contradicted by, or independent of the premises\. The dataset is split randomly into training, validation, and test sets \(∼\{\\sim\}1360/275/275 instances\)\.222Errors found in the validation split may hence apply to the other splits as well\.FOLIO is the dominant benchmark for NL\-to\-FOL translation\(MALLS;brunello2026llms\)and neurosymbolic reasoning\(DBLP:conf/emnlp/OlaussonGLZSTL23;DBLP:conf/iclr/RyuKLY25\)\. We verify its validation split \(FOLIO\_validation\)\.

MALLSis a large\-scale autoformalization dataset of NL–FOL pairs synthetically generated by GPT\-4, commonly used for training autoformalizers\(journals/corr/abs\-2509\-22338;DBLP:journals/corr/abs\-2409\-16461\)\. The full dataset comprises∼\{\\sim\}28K instances; 1000 were declared human\-checked and reserved as a test set\. We consider the first 100 of these \(MALLS\_test\)\.333The MALLS test set provides no metadata about the ordering of its instances, and the original paper does not document any explicit ordering criterion\. In the absence of such information, no sampling strategy is demonstrably more representative than any other; we take the first 100 instances, considering it a transparent and reproducible choice\.

We also include 213 instances fromGGC\(barker2011student\)as a control, to verify that our curation framework does not introduce noise into already\-correct annotations\. GGC contains students’ FOL submissions for*Tarski’s World*exercises from*Language, Proof and Logic*\(LPL\), validated by instructors via an automatic tool, providing strong guarantees of syntactic and semantic correctness\.444GGC is not publicly available but is shared by the authors upon request\.

### 2\.2Quality Analysis

Prior works\(DBLP:conf/emnlp/OlaussonGLZSTL23;brunello2026llms\)have noted the presence of errors in FOLIO and MALLS; this paper aims to quantify their impact on evaluation outcomes\.

In both datasets we identified bothtranslation errorsambiguous NL sentences\. Both phenomena may penalize a model that produces correct formalizations: translation errors cause a valid formula to be scored against an incorrect reference, while ambiguous sentences allow multiple legitimate interpretations, only one of which is encoded in the ground truth\. GGC showed no significant errors, with only a small portion of ambiguous sentences \(∼8\.5%\{\\sim\}8\.5\\%\)\.

The originalFOLIO\_validationalso containsNLI labels inconsistentwith the provided formalizations; we corrected 17 such labels \(∼8\.4%\{\\sim\}8\.4\\%of the dataset\) and re\-verified them against NL reasoning\.

In the following, we detail the annotation protocol, the issues identified during curation and the choices made to address them\.

#### Annotation Protocol\.

The curation was conducted by two experienced reviewers: a PhD student and a research fellow, both with a background in mathematics or computer science\. For each dataset, the instances were split equally between the two annotators, each of whom first reviewed their own half independently and then cross\-checked the half reviewed by the other\. Conflicts were resolved by discussion until consensus was reached\. No LLM output was consulted during the review process, with the sole exception of automatically mapping predicate and constant symbols to natural language glosses to aid interpretation; these mappings were subsequently verified manually by the annotators\.

#### Ontology Issues\.

Since no dataset provides an explicit ontology, we extract the signature—predicates, arities, and constants—automatically from the FOL formula and then we use an LLM to assign natural language meanings to each symbol, verifying then their consistency manually\.

We encountered two distinct types of ontological issues\. The first is*inconsistencies*: when a symbol appeared with conflicting arities or meanings across the premises and conclusion of the same story, or within the same formula, we manually corrected the ontology to restore consistency\. InMALLS\_test, we frequently found compound relational symbols \(e\.g\.,WorksInNewsIndustryAndReportsOnEvents\) that obscure the logical structure\. In such cases, we decomposed the predicate into its constituent parts \(e\.g\.,WorksInNewsIndustryandReportsOnEvents\) to recover a compositional ontology\.

The second type is*ontological looseness*: unconventional but internally consistent logical conventions applied uniformly throughout a formalization\. A concrete example is the following: the NL expressionschool eventsis encoded as the constantschoolEvent, even though FOL constants canonically denote specific individuals rather than classes\. A more standard rendering would use a predicateSchoolEvent\(x\)ranging over individuals\. This pattern recurs across several FOLIO stories\.

Unlike the first type, cases of ontological looseness are deliberately*preserved*in the curated release since they do not compromise the validity of the reasoning\. Correcting them would furthermore require rewriting every dependent formula, producing a new dataset rather than a curated version of the original; and since ontology design admits genuine degrees of freedom, any such revisions would substitute one legitimate modelling choice for another, with no principled basis for preferring a finer or a coarser granularity in the absence of additional contextual information\.

#### Translation Errors\.

We identified recurring errors that compromise the correctness of the ground truth\.Syntactic errors\(parenthesis mismatches, symbol typos, ontology misuse, and free variables\) affect 10\.5% of instances in FOLIO and 3% in MALLS\.Semantic errors, where a syntactically valid formula misrepresents the NL meaning, are more prevalent \(∼\\sim28\.3% in FOLIO and∼\\sim33% in MALLS\) and harder to detect automatically\. They include quantifier scope errors, missing information explicitly stated in the NL sentence, incorrect entity relativization, and broader logical structure failures\.

A summary of these findings is reported in Table[1](https://arxiv.org/html/2606.02837#S2.T1)\. Appendix[B](https://arxiv.org/html/2606.02837#A2)discusses how to reconcile these findings with seemingly contradictory LLM logical translation performance rates reported in the literature\. The curated dataset releases include corrected ontologies \(see Appendix[C](https://arxiv.org/html/2606.02837#A3)\), and corrected formalizations together with explanations of each correction \(see Appendix[D](https://arxiv.org/html/2606.02837#A4),[E](https://arxiv.org/html/2606.02837#A5), and[F](https://arxiv.org/html/2606.02837#A6)for details\)\.

#### Ambiguities\.

The inherent ambiguity of NL sentences poses a fundamental challenge for NL\-to\-FOL formalization\(DBLP:journals/aiopen/YadavPS21\): when acknowledged, it is typically addressed by requiring that sentences explicitly state all background information necessary for their interpretation, or that vague terms be avoided\. Since no prior work provides systematic criteria for identifying ambiguity in NL\-to\-FOL benchmarks, we adopt an operative definition and annotate ambiguous instances explicitly during the human curation phase\. We regard a sentence as ambiguous when, given a fixed ontology, multiple non\-equivalent formalizations are defensible\.

Ambiguity has direct consequences on evaluation: because all NL\-to\-FOL benchmarks supply a single reference formula, equivalence\-based scoring penalizes a model when it proposes correct alternative formalizations\.

Explicit ambiguity annotation thus serves a dual purpose: it enables downstream analyses to account for this confound, and it fosters a cleaner assessment of model capabilities\. For a qualitative classification of the ambiguity types encountered across datasets, see Appendix[G](https://arxiv.org/html/2606.02837#A7)\.

#### FOLIO’s NLI Labels Inconsistencies\.

As already discussed, FOLIO is also an NLI dataset\. For each set of premises\{p1,…,pn\}\\\{p\_\{1\},\\dots,p\_\{n\}\\\}and conclusioncc, there is also a labell∈\{l\\in\\\{True,False,Unknown\}\\\}that reflects ifccis entailed byp1,…,pnp\_\{1\},\\dots,p\_\{n\}\(True\), is in contradiction with them \(False\) or its truth value can’t be deductively established \(Unknown\)\.

FOLIO provides formalizationsφ1,…,φn\\varphi\_\{1\},\\dots,\\varphi\_\{n\}of the story premises andψ\\psiof the conclusion\. These can be used to check whether the labelllassigned to a pair premises\-conclusion is correct: indeed, the entailment should be preserved in FOL, so the value ofllshould depend on whether fromφ1,…,φn\\varphi\_\{1\},\\dots,\\varphi\_\{n\}, one could deriveψ\\psi, or its negation, or neither of them\.

Using the original FOL sentences, the label provided by the author is different from the one derived from the Z3 automatic solver\(de2008z3\)18% of times: this is largely motivated by the great amount of errors present in the FOL formulas, and partially \(2%\) by the fact that the inference works only assuming some background knowledge that is not explicitated in the NL sentence and it is not known if it can be assumed \(hence formalized\) or not\.

Our curation \(i\) corrects the NLI labels \(8\.4% of the entire dataset\), \(ii\) makes implicit background knowledge explicit when necessary for the entailment, \(iii\) re\-verifies the consistence between the new labels and the curated formalizations with the Z3 solver\.

### 2\.3LLM Re\-evaluation and Performance Shifts

Table 2:Translation accuracy \(%\) onFOLIO\_validationandMALLS\_testunder*Original*\(Orig\.\) and*Curated*\(Cur\.\) ground truth for all instances and the unambiguous subset\.Δ\\Delta= \(Curated – Original\); all values are positive, confirming that curation consistently raises measured accuracy\.We re\-evaluate a selection of recent LLMs—Gemma 4 31B\-itgemmateam2025gemma3technicalreport, Qwen3\-30B\-A3Byang2025qwen3technicalreport, and GPT\-4o\-miniopenai2024gpt4ocard—on the NL\-to\-FOL translation task using Chain\-of\-Thought prompting with the target ontology provided in context\.

Models are evaluated on both the original and curated datasets; to ensure a fair comparison, the ontology supplied at inference matches the one used in each dataset version\. It is checked then, via Z3, whether the formula produced by the model is logically equivalent to the original or curated formalization \(see Appendix[H](https://arxiv.org/html/2606.02837#A8)for details\)\. As shown in Table[2](https://arxiv.org/html/2606.02837#S2.T2), curation consistently raises measured accuracy across all models, with gains of \+9 to \+22 percentage points onFOLIOand \+9 to \+18 onMALLS\. The largest improvements are exhibited by Gemma 4, the strongest model, suggesting that annotation noise disproportionately penalizes models capable of producing valid but non\-reference formalizations\. We also report results restricted to unambiguous instances, providing a cleaner assessment of model capability that is not confounded by inherently ambiguous NL sentences\.

## 3LLM\-assisted Human Oversight

Manual inspection is the most reliable way to ensure ground\-truth quality, but does not scale: costs \(money and time\) grow linearly with dataset size, making exhaustive review impractical for large corpora\. Delegating verification to automation is equally unsatisfactory, since assessing the alignment between a NL sentence and its formal representation remains an open problemDBLP:conf/aaai/MensfeltCFKTS26\. We therefore investigate a middle ground: using LLMs as judgesDBLP:conf/nips/ZhengC00WZL0LXZ23to build a first\-pass filter that concentrates human attention on instances most likely to contain errors\. To this end, in this section we describe the*Verdict\-and\-Refinement Task*, two pipeline designs that instantiate it, and the resulting prioritization strategy for LLM\-assisted dataset curation\.

### 3\.1Verdict\-and\-Refinement Task

We define the*Verdict\-and\-Refinement*task \(V&R\) as follows: given a NL sentencepp, a candidate FOL formulaφ\\varphi, and an ontologyΩ\\Omega\(predicates, constants, and their associated meanings\), the goal is to produce \(i\) a*verdict*v∈\{yes,no,?\}v\\in\\\{\\textsf\{yes\},\\textsf\{no\},\\textsf\{?\}\\\}and \(ii\) a*refinement*FOL formulaψ\\psi\. Specifically:

- •v=yesv=\\textsf\{yes\}:φ\\varphicorrectly formalizesppin the given ontologyΩ\\Omega; no refinement is needed;
- •v=nov=\\textsf\{no\}:φ\\varphiincorrectly formalizesppunderΩ\\Omega; the refinementψ\\psishould provide a correct formalization ofpp;
- •v=?v=\\textsf\{?\}: the model is uncertain, orppis ambiguous, i\.e\., it admits multiple non\-equivalent yet defensible formalizations underΩ\\Omega: in the latter case,ψ\\psishould capture the most natural interpretation ofpp\.

The three\-way verdict reflects a deliberate design choice, driven primarily by the treatment of ambiguity\. A binaryyes/nodecision would force the model to either reject legitimate alternative interpretations or arbitrarily privilege one reading as canonical; the?label instead explicitly accommodates ambiguity while also allowing abstention under uncertainty\.

Pipeline 1Initial Dataset\(p,Ω,φ\)\(p,\\;\\Omega,\\;\\varphi\)V&RLLMv,ψv,\\;\\psiψ^=\{φifv=yesψotherwise\\hat\{\\psi\}=\\begin\{cases\}\\varphi&\\text\{if $v$ = \{yes\}\}\\\\ \\psi&\\text\{otherwise\}\\end\{cases\}Proposed Formula SelectionOutput\(p,Ω,ψ^\)\(p,\\;\\Omega,\\;\\hat\{\\psi\}\),vvPipeline 2Initial Dataset\(p,Ω\)\(p,\\;\\Omega\)LogicalTranslationLLMφ^\\hat\{\\varphi\}V&RLLMv,ψv,\\;\\psiψ^=\{φ^ifv=yesψotherwise\\hat\{\\psi\}=\\begin\{cases\}\\hat\{\\varphi\}&\\text\{if $v$ = \{yes\}\}\\\\ \\psi&\\text\{otherwise\}\\end\{cases\}Proposed Formula SelectionOutput\(p,Ω,ψ^\)\(p,\\;\\Omega,\\;\\hat\{\\psi\}\),vv

Figure 1:The two pipelines\. Each starts from theInitial Datasetcontaining triplets\(p,Ω,φ\)\(p,\\Omega,\\varphi\), and produces an output with theFormalization Proposal\(p,Ω,ψ^\)\(p,\\Omega,\\hat\{\\psi\}\)and theverdictvv\.Pipeline 1judges the original formulaφ\\varphidirectly\.Pipeline 2first re\-generates a candidateφ^\\hat\{\\varphi\}fromppandΩ\\Omegaalone, then judges it\.
### 3\.2Pipelines

Considering a dataset whose instances are triplets\(p,Ω,φ\)\(p,\\Omega,\\varphi\)as in Section[3\.1](https://arxiv.org/html/2606.02837#S3.SS1), we embed the V&R task within two distinct pipelines, illustrated in Figure[1](https://arxiv.org/html/2606.02837#S3.F1)\.

- •Pipeline 1: Direct V&R\.The LLM receives the triplet\(p,Ω,φ\)\(p,\\Omega,\\varphi\)and produces a verdictvvand, optionally, a refinementψ\\psiwhenφ\\varphiis judged incorrect or ambiguous\. The ultimately proposed formula isψ^=φ\\hat\{\\psi\}=\\varphiifv=yesv=\\textsf\{yes\}, andψ^=ψ\\hat\{\\psi\}=\\psiotherwise\.
- •Pipeline 2: Re\-generation and V&R\.The original formulaφ\\varphiis discarded\. The LLM first translatesppunderΩ\\Omegainto a candidateφ^\\hat\{\\varphi\}, which is then submitted to the same V&R step as in Pipeline 1\. The finally proposed formula isψ^=φ^\\hat\{\\psi\}=\\hat\{\\varphi\}ifv=yesv=\\textsf\{yes\}, andψ^=ψ\\hat\{\\psi\}=\\psiotherwise\.

The two pipelines embody different assumptions about the original formulaφ\\varphi\. Pipeline 1 treats it as a reliable anchor: refining an existing formula may be more effective than generating one from scratch\. Pipeline 2 treats it as a potential liability: whenφ\\varphiis unreliable or difficult for the model to evaluate, retaining it may mislead the V&R step, and a freshly generated candidate can be a safer starting point\.

Each pipeline processes the original dataset and produces an output comprising two elements for each instance: \(i\) theformalization proposalconsisting of triplets\(p,Ω,ψ^\)\(p,\\Omega,\\hat\{\\psi\}\), whereψ^\\hat\{\\psi\}is the pipeline’s proposed formula, and \(ii\) theverdictvvemitted during the V&R step, referring to the original formulaφ\\varphior to the LLM\-generated candidateφ^\\hat\{\\varphi\}, respectively in Pipeline 1 and Pipeline 2\. Despite this difference in reference, in both cases the verdict carries information about the reliability ofψ^\\hat\{\\psi\}: ayesverdict signals thatψ^\\hat\{\\psi\}derives from a formula that was deemed correct;noindicates it refines a formula deemed incorrect; and?flags either ambiguity inppor model uncertainty\.

### 3\.3Prioritization Order

Assume a pipeline is run on all instances of a dataset, producing outputs\(p,Ω,ψ^\)\(p,\\Omega,\\hat\{\\psi\}\), each paired with a verdictvv\. The verdictsvvcan be used to prioritize manual inspection, by inspecting first the instances in a certain verdict class, followed by others from another class, and so on: we call this*prioritization order*\. Intuitively, for example, ayesverdict means formulaψ^\\hat\{\\psi\}was considered correct by both the generator \(Pipeline 1: dataset creator; Pipeline 2: LLM\) and the V&R judge, making it the least likely to contain errors, and hence the last to review\. In Section[5](https://arxiv.org/html/2606.02837#S5), we show how to find an adequate prioritization order\.

## 4Experimental Setup

To evaluate our proposed LLM\-assisted oversight approach, we require ground\-truth annotations for each instance\. We therefore assume every original triplet\(p,Ω,φ\)\(p,\\Omega,\\varphi\)is accompanied by \(i\) a gold verdictvgold∈\{yes,no,?\}v\_\{\\text\{gold\}\}\\in\\\{\\textsf\{yes\},\\textsf\{no\},\\textsf\{?\}\\\}assessing the correctness ofφ\\varphior ambiguity ofpp, and \(ii\) a gold formalizationφgold\\varphi\_\{\\text\{gold\}\}ofppunderΩ\\Omega\.

Specifically, for each instance\(p,φ\)\(p,\\varphi\)in the originalFOLIO\_validationandMALLS\_test, we derive the ontologyΩ\\Omegafromφ\\varphiand construct the curated ground\-truthφgold\\varphi\_\{\\text\{gold\}\}; given this, we derive the verdictvgoldv\_\{\\text\{gold\}\}marking asyes,no, and?the instances for whichφ≡φgold\\varphi\\equiv\\varphi\_\{\\text\{gold\}\}, for whichφ≢φgold\\varphi\\not\\equiv\\varphi\_\{\\text\{gold\}\}, and for whichppis ambiguous\. This is exactly the process used in Section[3](https://arxiv.org/html/2606.02837#S3)for curating the datasets\. We also perform our evaluation onGGCto measure the possible degradation introduced by the pipelines when applied to already\-correct data\. Since GGC annotations are verified correct,vgold=yesv\_\{\\text\{gold\}\}=\\textsf\{yes\}andφgold=φ\\varphi\_\{\\text\{gold\}\}=\\varphifor all instances, except for the few ambiguous sentences \(wherevgold=?v\_\{\\text\{gold\}\}=\\textsf\{?\}\)\.

#### Simulating human review\.

Our primary interest is in measuring how effectively each proposed pipeline, when coupled with a prioritization order, concentrates errors at the top of the inspection queue, so that a human annotator achieves high accuracy while reviewing as few instances as possible\. We model this by simulating a perfect annotator who, upon inspecting an instance, immediately corrects it\. We track*accuracy*—the fraction of correct instances in the dataset—as a function of the fraction of instances reviewed by the human,h∈\[0,1\]h\\in\[0,1\]; we refer to this as the*accuracy–human effort curve*\.

Two baselines provide reference curves against which to compare the prioritization effectiveness of the pipelines: \(i\)Black Baseline \(original dataset\)—the annotator reviews instances from the original dataset in random order\. Accuracy starts at the original dataset accuracy and increases linearly to 100% ashhreaches 1; \(ii\)Green Baseline \(LLM regeneration\)—the annotator starts from a synthetic dataset produced in the Logical Translation phase of Pipeline 2 and reviews instances in random order\. Accuracy starts at the regenerated dataset accuracy and increases linearly to 100% ashhreaches 1\.

#### Metrics\.

As we mentioned, each strategy corresponds to an accuracy–human effort curve, with steeper early rises indicating more effective prioritization\. Our primary metric is the*Area Under the Curve*\(AUC; higher is better\)\. Since AUC alone is not easily interpretable, we complement it withTαT\_\{\\alpha\}, the minimum fraction of instances to review to reachα%\\alpha\\%accuracy \(lower is better\)\. We reportT90T\_\{90\}andT95T\_\{95\}\.

#### Models and prompting\.

We run all pipelines with three models: GPT\-4o\-mini, Qwen3\-30B\-A3B, and Gemma 4 31B\-it\. For each model, we evaluate four prompting strategies varying in reasoning approach \(standard CoT \(B1\), few\-shot \(B2\), few\-shot\+CoT \(B3\), and meta\-prompting \(B4\)\) and four prompt variants \(pv1\-pv4\) that differ in their degree of skepticism toward the input formula and level of instructional detail\. Full details are in Appendix[H](https://arxiv.org/html/2606.02837#A8)\.

## 5Results and Discussions

![Refer to caption](https://arxiv.org/html/2606.02837v1/x1.png)Figure 2:Pipelines comparison across models \(horizontally\) and datasets \(vertically\) according to the AUC \(Area Under the Curve\) metric\. Asterisks denote statistical significance \(\*\*\* =p<0\.001p<0\.001, \*\* =p<0\.01p<0\.01,p<0\.05p<0\.05\) under the one\-sided paired Wilcoxon signed\-rank test\.ns: non\-significant\.![Refer to caption](https://arxiv.org/html/2606.02837v1/x2.png)Figure 3:*Accuracy\-human effort*curve\. Each plot shows Gemma’s performance onFOLIO\_validation\(left\) andMALLS\_test\(right\)\. The blue band represents Pipeline 1 \(min/max and average across the six prompting combinations \{B1, B2, B3\}×\\times\{pv1, pv3\}\); the red curve shows the best\-AUC configuration of Pipeline 2; the black and green line represents respectively the Black and Green Baseline\.For each of the two pipelines, in Appendix[I](https://arxiv.org/html/2606.02837#A9), we measure accuracy within each verdict class and find that it increases monotonically from?tonotoyes\. We thus adopt the following prioritization order for both of them: \(i\) malformed outputs, \(ii\)?, \(iii\)no, \(iv\)yes\. All results below are reported under this prioritization\.

Figure[2](https://arxiv.org/html/2606.02837#S5.F2)compares the prioritization strategies across datasets and models\. In terms of AUC, Pipeline 1 \(blue\) outperforms Pipeline 2 \(red\), which in turn outperforms the Green Baseline \(green\)\. This ordering holds across all metrics \(T90T\_\{90\},T95T\_\{95\}\) and is statistically significant \(Appendix[J](https://arxiv.org/html/2606.02837#A10)\)\.555For simplicity, these figures omit the Black Baseline, which is based on the original dataset; its \(bad\) performance is addressed below\.Among the models evaluated, Gemma performs best \(Appendix[K](https://arxiv.org/html/2606.02837#A11)\)\.

Given the witnessed results, we focus on Gemma with Pipeline 1 for the main analysis\. Among the prompting strategies, B4 and variants pv2 and pv4 perform significantly worse than the remaining configurations; the others \(B1, B2, B3 with pv1 and pv3\) are statistically indistinguishable, indicating that Pipeline 1 is robust to prompting choices within this set \(Appendix[L](https://arxiv.org/html/2606.02837#A12)\)\. See Appendix[M](https://arxiv.org/html/2606.02837#A13)for the results for all models and datasets\.

Figure[3](https://arxiv.org/html/2606.02837#S5.F3)plots the*accuracy–human effort curves*, showing the full range of Pipeline 1 performance across the retained prompting configurations \(blue band\) and the best\-AUC configuration for Pipeline 2\. The ordering among Green Baseline, Pipeline 2, and Pipeline 1 is preserved across all datasets and metrics, with the sole exception of MALLST90T\_\{90\}, where Pipeline 2 performs slightly better than Pipeline 1 \(0\.13 vs 0\.12\)\.

The steep initial slopes confirm that the above prioritization order effectively front\-loads error\-prone instances\. The practical gains are substantial: onFOLIO, 90% accuracy is reached after reviewing only 24% of instances—less than half the effort required by the Green Baseline \(63%\) and a third of that required by the Black Baseline \(74%\)\. OnMALLS, the reduction is even more pronounced: 13% suffices against 38% for the Green Baseline \(a3×3\\timesreduction\) and against 72% for the Black Baseline \(a5×5\\timesreduction\)\. Comparable gains hold forT95T\_\{95\}\.

Furthermore, in both datasets, instances assigned ayesverdict constitute around half of the data; accuracy restricted to such a class is above 97%\. This opens the possibility of treating such a class as a ready\-to\-use curated training set; however, doing so halves the dataset and may disproportionately discard the most challenging instances—a trade\-off that practitioners should weigh carefully\.

Finally, when applied to the already 100% correct dataset GGC, all three models with Pipeline 1 produces a limited degradation≤5%\\leq 5\\%\(Appendix[M](https://arxiv.org/html/2606.02837#A13)\)\.

## 6Conclusions and Future Works

This work addresses systematic annotation errors in two widely used NL\-to\-FOL datasets,FOLIO\_validationandMALLS\_test, uncovering error rates of 39% and 36% respectively\. These errors measurably distort LLM evaluation, with performance estimates shifting by up to 22 points once corrected annotations are used\. To address the infeasibility of exhaustive human oversight, we propose an LLM\-assisted framework that concentrates human effort on the instances most likely to contain errors\. The best pipeline configurations allow a human annotator to reach 90% accuracy after reviewing only 24% ofFOLIO\_validationand 13% ofMALLS\_test\.

Future investigation includes: \(i\) a natural extension is applying the pipeline to the full training splits ofFOLIOandMALLS, where exhaustive manual review is prohibitive but systematic quality improvements would have broad impact: \(ii\) increasing pipeline autonomy through iterative refinement loops and automated ontology construction would reduce the remaining manual effort, \(iii\) introducing a dedicated pre\-processing stage to detect and resolve linguistic ambiguity prior to formalization is a promising direction toward a more robust curation process\.

## 7Limitations

Dataset curation is inherently delicate and error\-prone\. To mitigate this risk, all instances were independently annotated by one expert and then cross\-checked by a second: each annotator reviewed the other’s half of the dataset, and disagreements were resolved by discussion\. Residual annotation mistakes may nonetheless remain\. Annotators corrected instances, without access to the outputs of the models evaluated in Section[2\.3](https://arxiv.org/html/2606.02837#S2.SS3)and Section[5](https://arxiv.org/html/2606.02837#S5), ruling out any risk of bias toward the reasoning patterns of a specific tested model\.

A further limitation concerns ambiguity: the criteria used to flag ambiguous sentences are not standardized by prior work, and alternative strategies may be equally defensible\. Finally, our conclusions hold consistently across three datasets and three LLMs, but may not generalize to other logical formalisms, domains, or model families\.

## References

## Appendix AAutoformalization in FOL: Background

#### First\-Order Logic\.

A*signature*σ=\(𝒱,𝒞,ℱ,ℛ\)\\sigma=\(\\operatorname\{\\mathcal\{V\}\},\\operatorname\{\\mathcal\{C\}\},\\operatorname\{\\mathcal\{F\}\},\\operatorname\{\\mathcal\{R\}\}\)consists of a countably infinite set of variables𝒱\\operatorname\{\\mathcal\{V\}\}, a set of constants𝒞\\operatorname\{\\mathcal\{C\}\}, a set of function symbolsℱ\\operatorname\{\\mathcal\{F\}\}, and a set of relation/predicate symbolsℛ\\operatorname\{\\mathcal\{R\}\}\(each with an associated arity\), where all four sets are pairwise disjoint\.*Terms*are built inductively: variables and constants are base terms; ift1,…,tnt\_\{1\},\\dots,t\_\{n\}are terms andf∈ℱf\\in\\operatorname\{\\mathcal\{F\}\}has aritynn, thenf​\(t1,…,tn\)f\(t\_\{1\},\\dots,t\_\{n\}\)is a term\.*FOL formulas*overσ\\sigmaare defined inductively by:

φ:=\\displaystyle\\varphi\\ :=\\r\(t1,…,tn\)\|¬φ\|φ∧φ\|φ∨φ\|φ⊕φ\|\\displaystyle r\(t\_\{1\},\\dots,t\_\{n\}\)\\ \|\\ \\neg\\varphi\\ \|\\ \\varphi\\land\\varphi\\ \|\\ \\varphi\\lor\\varphi\\ \|\\ \\varphi\\oplus\\varphi\\ \|φ→φ\|φ↔φ​\|∃x​φ\|​∀x​φ,\\displaystyle\\varphi\\to\\varphi\\ \|\\ \\varphi\\leftrightarrow\\varphi\\ \|\\ \\exists x\\,\\varphi\\ \|\\ \\forall x\\,\\varphi,wherer∈ℛr\\in\\operatorname\{\\mathcal\{R\}\}with aritynn, thetit\_\{i\}are terms, and¬,∧,∨,⊕,→,↔,∃,∀\\neg,\\land,\\lor,\\oplus,\\to,\\leftrightarrow,\\exists,\\foralldenote negation, conjunction, inclusive disjunction, exclusive disjunction, implication, equivalence, and the existential and universal quantifiers respectively\.

#### Operator precedence\.

When parentheses are omitted, the logical operators above are assumed to bind in the following order \(from highest to lowest precedence\):

\{¬,∃,∀\}≻∧≻∨≻→≻↔≻⊕\.\\\{\\neg,\\,\\exists,\\,\\forall\\\}\\;\\succ\\;\\land\\;\\succ\\;\\lor\\;\\succ\\;\\to\\;\\succ\\;\\leftrightarrow\\;\\succ\\oplus\.For example,¬A∧B→C∨D\\neg A\\land B\\to C\\lor Dis parsed as\(\(¬A\)∧B\)→\(C∨D\)\(\(\\neg A\)\\land B\)\\to\(C\\lor D\)\. Parentheses are required explicitly whenever the intended scoping departs from this convention\.

#### Ontology\.

Evaluating whether a FOL formula faithfully represents a natural language sentence requires an interpretation of the symbols in the formula\. We use the term*ontology*to refer to the signature enriched with the intended NL meaning of each symbol \(e\.g\.,Tet​\(x\)\\text\{Tet\}\(x\)means “xxis a tetrahedron”\)\. We denote this usually with the symbolΩ\\Omega\.

#### The autoformalization task\.

Autoformalization is the task of producing a FOL representation that symbolically captures the semantic meaning of a NL phrasepp\. Followingbrunello2026llms, the task decomposes along two dimensions:

- •Ontology Extraction \(OE\):identifying a suitable signature for the formalization, along with the interpretation of the symbols in it\.
- •Logical Translation \(LT\):translatingppinto a FOL formula using the symbols and meanings provided by a given ontology\.

This decomposition arises from the fact that determining whether a given formula is equivalent to another is only possible when the two formulas share the same signature\. Therefore, in this paper, whenever we ask a model to produce a formalization, we also provide the signature as part of the input \(LT task\)\. This allows us to reliably evaluate performance on the formalization task automatically\.

As noted inbrunello2026llms, without a labeled signature it is also impossible to manually assess whether a formula conveys the same meaning as a natural language sentence, since the interpretation of the symbols used in the formula would remain unclear\.

## Appendix BDataset Error Rates are Coherent with Recent Literature

At first glance, the high percentage of incorrect formalizations we identify in theFOLIO\_validationdataset \(38\.9%\) may seem to contradict recent findings in the literature\. For instance,\(brunello2026llms\)demonstrated that state\-of\-the\-art models like o3\-mini can achieve approximately 80% accuracy in NL to FOL translation tasks on the train split of FOLIO dataset\.

#### Sentence\-Level vs\. Mixed\-Instance Evaluation\.

The discrepancy stems from the fundamental unit of analysis\. Evaluations that report∼\\sim80% accuracy assess*isolated sentences*translated into standalone FOL formulas\. In contrast, our quality assessment of FOLIO evaluates the dataset according to its actual NLI structure\.

TheFOLIO\_validationsplit contains 275 total instances, divided into two distinct types:

1. 1\.Stories \(73 instances\):The concatenated premises of an NLI problem, averaging roughly 6 natural language sentences each\.
2. 2\.Conclusions \(202 instances\):Single\-sentence logical conclusions\.

In the context of automated reasoning, a story’s formalization is only correct if*every single component sentence*is flawlessly translated and ontologically consistent\. The reason behind the choice to consider the story as a whole is that some information is not explicit in specific sentences and derives from the context posed by the premises in the story: thus, if we have in mind general applications of the autoformalization task, ours seems the most natural choice\.

#### Mathematical Justification\.

We can demonstrate that our 38\.9% instance\-level error rate is entirely consistent with∼\\sim80% sentence\-level accuracy as reported in the literature\.

TheFOLIO\_validationdataset contains 275 instances composed of 73 multi\-sentence stories \(averaging 6 sentences each\) and 202 single\-sentence conclusions, yielding approximately73×6\+202=64073\\times 6\+202=640total sentences\. Supposing that any error done by the model in FOLIO stories are located in only one sentence, based on our error rate, we get38\.9%×275∼10738\.9\\%\\times 275\\sim 107incorrect sentences\. A perfect annotator then would have produced an accuracy of:

640−107640∼83%\\frac\{640\-107\}\{640\}\\sim 83\\%which is consistent with the∼\\sim80% accuracy reported by\(brunello2026llms\)when evaluating state\-of\-the\-art models on isolated FOLIO sentences\.

We also notice that the dataset used for the evaluation inbrunello2026llmsis different from the one used here \(FOLIO\_trainingvs\.FOLIO\_validation\) even if they are random splits coming from the same pool\.

## Appendix CInstance of The Ontology File

We report here an example of the entries of the ontology file given whithin the code\. The following JSON is what can be found for the FOLIO instancestory\_380\. Each entry inRelcorresponds to a predicate symbol extracted from the curated FOL formula, annotated with its arity \(arity\) and NL meanings \(i\.e\., its NL interpretation \(pos\_meaning\), and the meaning when the predicate is negated \(neg\_meaning\)\); argument positions are indicated by\{0\},\{1\}, etc\. Each entry inConstmaps a logical constant name to its NL interpretation as it appears in the following example\.

For FOLIO, each ontology is described by thestory\_id: for each premises in the multi\-sentence story and for each conclusion related to the story, the formalization in FOL provided in the curated dataset adhere to the symbols here listes\. For MALLS, each sentence comes with an own ontology: hence, we will have 73 entries for FOLIO and 100 for MALLS\.

\{

"story\_id":380,

"Rel":\{

"InThisClub":\{"arity":1,"pos\_meaning":"\{0\}isinthisclub",

"neg\_meaning":"\{0\}isnotinthisclub"\},

"PerformOftenIn":\{"arity":2,"pos\_meaning":"\{0\}performsoftenin\{1\}",

"neg\_meaning":"\{0\}doesnotperformoftenin\{1\}"\},

"Attend":\{"arity":2,"pos\_meaning":"\{0\}attends\{1\}",

"neg\_meaning":"\{0\}doesnotattend\{1\}"\},

"VeryEngagedWith":\{"arity":2,"pos\_meaning":"\{0\}isveryengagedwith\{1\}",

"neg\_meaning":"\{0\}isnotveryengagedwith\{1\}"\},

"InActive":\{"arity":1,"pos\_meaning":"\{0\}isinactive",

"neg\_meaning":"\{0\}isnotinactive"\},

"Disinterested":\{"arity":1,"pos\_meaning":"\{0\}isdisinterested",

"neg\_meaning":"\{0\}isnotdisinterested"\},

"MemberOf":\{"arity":2,"pos\_meaning":"\{0\}isamemberof\{1\}",

"neg\_meaning":"\{0\}isnotamemberof\{1\}"\},

"Chaperone":\{"arity":2,"pos\_meaning":"\{0\}chaperones\{1\}",

"neg\_meaning":"\{0\}doesnotchaperone\{1\}"\},

"Student":\{"arity":1,"pos\_meaning":"\{0\}isastudent",

"neg\_meaning":"\{0\}isnotastudent"\},

"AttendSchool":\{"arity":1,"pos\_meaning":"\{0\}attendsschool",

"neg\_meaning":"\{0\}doesnotattendschool"\},

"YoungChild":\{"arity":1,"pos\_meaning":"\{0\}isayoungchild",

"neg\_meaning":"\{0\}isnotayoungchild"\},

"Teenager":\{"arity":1,"pos\_meaning":"\{0\}isateenager",

"neg\_meaning":"\{0\}isnotateenager"\},

"WishToFurther":\{"arity":2,"pos\_meaning":"\{0\}wishestofurther\{1\}",

"neg\_meaning":"\{0\}doesnotwishtofurther\{1\}"\}

\},

"Const":\{

"bonnie":"Bonnie",

"schoolTalentShow":"theschooltalentshow",

"schoolEvent":"schoolevents",

"community":"thecommunity",

"highSchoolDance":"highschooldances",

"academicCareer":"academiccareerandeducationalopportunities",

"educationalOpportunity":"educationalopportunities"

\}

\}

## Appendix DReleased Data Format Of Human\-Annotated Subsets

We release the human\-annotated portions of FOLIO \(275 instances\) and MALLS \(100 instances\) as JSON records designed for reliable evaluations\.

#### FOLIO: story records and conclusion records\.

We store each FOLIO instance as a*story record*\(containing the full multi\-sentence NL premises \(story\-level context\) together with its FOL translation\), or as a*conclusion record*\(each record contains a conclusion in NL with its FOL counterpart\)\. An example is reported in Appendix[E](https://arxiv.org/html/2606.02837#A5)\.

#### Common fields\.

Both record types share:

- •id: unique identifier \(e\.g\.,story\_471,concl\_471\_20\)\.666For the story records, the last number identifies the story ID in the original dataset\. For the conclusion records, the first number identifies the story ID to which the conclusion is linked, while the second number identifies the row in which the conclusion appears in the original version of the dataset hosted on Hugging Face, available at[https://huggingface\.co/datasets/yale\-nlp/FOLIO/viewer/default/validation](https://huggingface.co/datasets/yale-nlp/FOLIO/viewer/default/validation)\.
- •NL\_sentence: the NL text\.
- •FOL\_sentence\_old: original formula from the source dataset\.
- •FOL\_sentence: curated formula \(equal toFOL\_sentence\_oldif unchanged\)\.
- •corrected: whether we corrected the original formula\.

#### Ambiguity metadata\.

If the NL admits multiple plausible readings, we set:

- •ambiguity: boolean\.
- •ambiguity\_explanation: categorical tagsome\_plural,either\_or,option\_partition,sufficient\_necessary\_condition,determinative\_the,existential\_universalwhen possible or otherwise an explanation of the ambiguity we detected\.

Even whenambiguity=truewe provide, in theFOL\_sentencefield, the most plausible alternative given the instructions detailed in Section Section[2\.2](https://arxiv.org/html/2606.02837#S2.SS2.SSS0.Px4)\.

#### Correction explanation\.

Whencorrected=true, we include:

- •correction\_explanation: short description of what was changed and why\.

#### NLI labels \(conclusion records only\)\.

Conclusion records additionally include:

- •label\_old: the original FOLIO NLI label\.
- •label\_old\_Z3: the NLI label derivable from the original formalizations
- •label: our corrected label \(coherent withlabel\_Z3\)
- •label\_Z3: the NLI label derivable from the new formalizations \(plus eventually the information in theNLI\_requires\_more\_info\.
- •NLI\_requires\_more\_info: the FOL formula that represents the background knowledge that must be added to the story to ensure that the NLI label is consistent with the FOL entailment\.

#### MALLS: item records\.

MALLS uses a single\-item format, so each instance corresponds to a unique sentence always\. Each record includesid,NL\_sentence,FOL\_sentence\_old,FOL\_sentence\_new,corrected, and \(when applicable\)ambiguitywithambiguity\_explanationas explained above\. You can find an example in the Appendix[F](https://arxiv.org/html/2606.02837#A6)\.

#### Ontology\.

Each FOLIO story with conclusions and MALLS instance is further accompanied by an*ontology*\(signature enriched with LLM\-generated glosses\): see Appendix[C](https://arxiv.org/html/2606.02837#A3)\.

## Appendix EExample of ReleasedFOLIO\_validation\_curatedDataset

Story record\.

\{

"id":"story\_380",

"NL\_sentence":"Peopleinthisclubwhoperforminschooltalentshowsoftenattendandareveryengagedwithschoolevents\.Peopleinthisclubeitherperforminschooltalentshowsoftenorareinactiveanddisinterestedcommunitymembers\.Peopleinthisclubwhochaperonehighschooldancesarenotstudentswhoattendtheschool\.Allpeopleinthisclubwhoareinactiveanddisinterestedmembersoftheircommunitychaperonehighschooldances\.Allyoungchildrenandteenagersinthisclubwhowishtofurthertheiracademiccareersandeducationalopportunitiesarestudentswhoattendtheschool\.Bonnieisinthisclubandsheeitherbothattendsandisveryengagedwithschooleventsandisastudentwhoattendstheschoolorisnotsomeonewhobothattendsandisveryengagedwithschooleventsandisnotastudentwhoattendstheschool\.",

"FOL\_sentence":"∀\\forallx\(InThisClub\(x\)∧\\landPerformOftenIn\(x,schoolTalentShow\)→\\rightarrowAttend\(x,schoolEvent\)∧\\landVeryEngagedWith\(x,schoolEvent\)\)∧\\land∀\\forallx\(InThisClub\(x\)→\\rightarrowPerformOftenIn\(x,schoolTalentShow\)⊕\\oplus\(InActive\(x\)∧\\landDisinterested\(x\)∧\\landMemberOf\(x,community\)\)\)∧\\land∀\\forallx\(InThisClub\(x\)∧\\landChaperone\(x,highSchoolDance\)→\\rightarrow¬\\lnot\(Student\(x\)∧\\landAttendSchool\(x\)\)\)∧\\land∀\\forallx\(InThisClub\(x\)∧\\land\(InActive\(x\)∧\\landDisinterested\(x\)∧\\landMemberOf\(x,community\)\)→\\rightarrowChaperone\(x,highSchoolDance\)\)∧\\land∀\\forallx\(InThisClub\(x\)∧\\land\(YoungChild\(x\)∨\\lorTeenager\(x\)\)∧\\landWishToFurther\(x,academicCareer\)∧\\landWishToFurther\(x,educationalOpportunity\)→\\rightarrow\(Student\(x\)∧\\landAttendSchool\(x\)\)\)∧\\landInThisClub\(bonnie\)∧\\land\(\(Attend\(bonnie,schoolEvent\)∧\\landVeryEngagedWith\(bonnie,schoolEvent\)∧\\landStudent\(bonnie\)∧\\landAttendSchool\(bonnie\)\)⊕\\oplus\(¬\\lnot\(Attend\(bonnie,schoolEvent\)∧\\landVeryEngagedWith\(bonnie,schoolEvent\)\)∧\\land¬\\lnotStudent\(bonnie\)∧\\land¬\\lnotAttendSchool\(bonnie\)\)\)",

"corrected":"yes",

"ambiguity":true,

"ambiguity\_explanation":"either\_or",

"correction\_explanation":"1\.Spelling:Corrected'Studen'to'Student'and'bonne'to'bonnie'\.2\.Linkage:'highSchoolDances'changedto'highSchoolDance'inPremise4\.3\.MissingParenthesis&Operators:FixedthemissingopeningparenthesisfortheimplicationinPremise5\.Changed'YoungChildren'to'YoungChild'\.Addedmissing'educationalOpportunity'toWishToFurthertofullycapturetheNL\.",

"FOL\_sentence\_old":"∀\\forallx\(InThisClub\(x\)∧\\landPerformOftenIn\(x,schoolTalentShow\)→\\rightarrowAttend\(x,schoolEvent\)∧\\landVeryEngagedWith\(x,schoolEvent\)\)\\n∀\\forallx\(InThisClub\(x\)→\\rightarrowPerformOftenIn\(x,schoolTalentShow\)⊕\\oplus\(InActive\(x\)∧\\landDisinterested\(x\)∧\\landMemberOf\(x,community\)\)\)\\n∀\\forallx\(InThisClub\(x\)∧\\landChaperone\(x,highSchoolDance\)→\\rightarrow¬\\lnot\(Studen\(x\)∧\\landAttendSchool\(x\)\)\)\\n∀\\forallx\(InThisClub\(x\)∧\\land\(InActive\(x\)∧\\landDisinterested\(x\)∧\\landMemberOf\(x,community\)\)→\\rightarrowChaperone\(x,highSchoolDances\)\)\\n∀\\forallx\(InThisClub\(x\)∧\\land\(YoungChildren\(x\)⊕\\oplusTeenager\(x\)\)∧\\landWishToFurther\(x,academicCareer\)\)→\\rightarrowStuden\(x\)∧\\landAttendSchool\(x\)\)\\nInThisClub\(bonnie\)∧\\land¬\\lnot\(\(Attend\(x,schoolEvent\)∧\\landVeryEngagedWith\(bonnie,schoolEvent\)\)⊕\\oplus\(Studen\(bonne\)∧\\landAttendSchool\(bonnie\)\)\)"

\}

Conclusion record \(example\)\.

\{

"id":"concl\_380\_1",

"NL\_sentence":"Bonnieperformsinschooltalentshowsoften\.",

"FOL\_sentence":"PerformOftenIn\(bonnie,schoolTalentShow\)",

"label":"Uncertain",

"corrected":"yes",

"correction\_explanation":"1\.PredicateMatch:Corrected'Perform'to'PerformOftenIn'tomatchthepremises\.2\.RemovedExtraneousConjunction:Theoriginalincluded'InThisClub\(bonnie\)'whichisnotstatedintheconclusionNL\.",

"FOL\_sentence\_old":"InThisClub\(bonnie\)∧\\land\(Perform\(bonnie,schoolTalentShow\)\)",

"label\_old":"Uncertain"

\}

## Appendix FExample of ReleasedMALLS\_test\_curatedDataset Example

\{

"NL\_sentence":"Avacationisrelaxingifitincludesbeautifulsceneryandenjoyableactivities\.",

"FOL\_sentence\_old":"∀\\forallx\(Vacation\(x\)∧\\landRelaxing\(x\)→\\rightarrow\(BeautifulScenery\(x\)∧\\landEnjoyableActivities\(x\)\)\)",

"FOL\_sentence\_new":"∀\\forallx\(Vacation\(x\)→\\rightarrow\(\(BeautifulScenery\(x\)∧\\landEnjoyableActivities\(x\)\)→\\rightarrowRelaxing\(x\)\)\)",

"corrected":"yes",

"correction\_explanation":"1\.Theoriginalformulainvertedtheimplicationsymbols\.",

"id":1

\}

## Appendix GClassification of Common Ambiguities

We report a classification of the recurring NL ambiguities we detected during human oversight of FOLIO and MALLS\. Each type is described with an explanation, a concrete example showing the two plausible FOL readings, and the canonicalization convention we adopt\. This is not intended as a comprehensive taxonomy of NL ambiguity in general; it is an empirically grounded catalogue of the patterns that arise frequently in existing NL–FOL datasets for the autoformalization task\.

#### some\_plural\.

The wordsomefollowed by a plural noun is ambiguous between asserting the existence of at least one individual satisfying a property and asserting the existence of at least two distinct individuals\. For example,some pets are not mammalscan be formalized as∃x​\(Pet​\(x\)∧¬Mammal​\(x\)\)\\exists x\\,\(\\text\{Pet\}\(x\)\\land\\lnot\\text\{Mammal\}\(x\)\)or as∃x​∃y​\(Pet​\(x\)∧¬Mammal​\(x\)∧Pet​\(y\)∧¬Mammal​\(y\)∧x≠y\)\\exists x\\,\\exists y\\,\(\\text\{Pet\}\(x\)\\land\\lnot\\text\{Mammal\}\(x\)\\land\\text\{Pet\}\(y\)\\land\\lnot\\text\{Mammal\}\(y\)\\land x\\neq y\)\. We generally adopt the one\-witness reading \(first formula\) and flag the instance as ambiguous\. We decide to flag this form of ambiguity since FOLIO sometimes use the one\-witness formulation and others the two\-witnesses one\.

#### either\_or\.

The connectiveeither…oris ambiguous between inclusive disjunction \(∨\\lor\) and exclusive disjunction \(⊕\\oplus\)\. For example,Peter’s pets are all either monkeys or birdscan be read inclusively as∀x​\(PetOfPeter​\(x\)→\(Monkey​\(x\)∨Bird​\(x\)\)\)\\forall x\\,\(\\text\{PetOfPeter\}\(x\)\\rightarrow\(\\text\{Monkey\}\(x\)\\lor\\text\{Bird\}\(x\)\)\)or exclusively as∀x​\(PetOfPeter​\(x\)→\(Monkey​\(x\)⊕Bird​\(x\)\)\)\\forall x\\,\(\\text\{PetOfPeter\}\(x\)\\rightarrow\(\\text\{Monkey\}\(x\)\\oplus\\text\{Bird\}\(x\)\)\), where the latter rules out any pet being both a monkey and a bird\. We generally adopt the exclusive reading \(⊕\\oplus\) and flag the instance as ambiguous\.

#### option\_partition\.

Enumerated categorization \(e\.g\.,XXcan be categorized asAA,BB, andCC\) is ambiguous between an inclusive partition \(allowing overlap\) and an exclusive partition \(enforcing disjointness\)\. For example,machine learning algorithms can be categorized as supervised learning, unsupervised learning, and reinforcement learningcan be read inclusively as∀x\(MLA\(x\)↔\(SL\(x\)∨UL\(x\)∨RL\(x\)\)\)\\forall x\\,\(\\text\{MLA\}\(x\)\\leftrightarrow\(\\text\{SL\}\(x\)\\lor\\text\{UL\}\(x\)\\lor\\text\{RL\}\(x\)\)\), or exclusively by additionally enforcing pairwise disjointness\. We generally adopt the inclusive reading \(∨\\lor\) and flag the instance as ambiguous\.

#### sufficient\_necessary\_condition\.

Conditional phrasing such asXXifPPcan denote a sufficient condition \(P→XP\\rightarrow X\) or be intended as a biconditional \(X↔PX\\leftrightarrow P\) when the intended meaning of the NL sentence is to*define*a specific category of entities\. For example,An organization is non\-profit if it has a charitable mission, does not distribute profits to owners, and primarily relies on donationscan be interpreted as∀x​\(\(Organization​\(x\)∧CharitableMission​\(x\)∧¬DistributesProfits​\(x\)∧ReliesOnDonations​\(x\)\)→NonProfit​\(x\)\)\\forall x\\,\(\(\\text\{Organization\}\(x\)\\land\\text\{CharitableMission\}\(x\)\\land\\lnot\\text\{DistributesProfits\}\(x\)\\land\\text\{ReliesOnDonations\}\(x\)\)\\rightarrow\\text\{NonProfit\}\(x\)\), but a plausible alternative is a biconditional that also constrains*all*non\-profits to satisfy those properties\. We flag such cases as ambiguous and, when not clearly disambiguated by the context, adopt theP→XP\\rightarrow Xinterpretation\.

#### determinative\_the\.

The definite articlethepresupposes a unique, contextually salient referent, but in isolation it is often unclear whether the formalization should encode uniqueness explicitly\. For example,the cube is redcould be formalized as a simple existential∃x​\(Cube​\(x\)∧Red​\(x\)\)\\exists x\\,\(\\text\{Cube\}\(x\)\\land\\text\{Red\}\(x\)\), or with an explicit uniqueness constraint∃x​\(Cube​\(x\)∧Red​\(x\)∧∀y​\(Cube​\(y\)→y=x\)\)\\exists x\\,\(\\text\{Cube\}\(x\)\\land\\text\{Red\}\(x\)\\land\\forall y\\,\(\\text\{Cube\}\(y\)\\rightarrow y=x\)\), or even as a direct predication over a named constant when the referent has been introduced earlier in a story\. Without prior discourse context, the intended reading is underspecified\. We flag such cases as ambiguous and, when a single formalization is required, default to the simple existential unless story\-level context motivates the uniqueness reading\.

#### existential\_universal\.

Bare singular noun phrases of the formaCChas propertyPPare ambiguous between a universal reading \(everyCChasPP\) and an existential one \(some particularCChasPP\)\. For example,a child has a motheris most naturally read as a universal generalization∀x​\(Child​\(x\)→∃y​\(Mother​\(y,x\)\)\)\\forall x\\,\(\\text\{Child\}\(x\)\\rightarrow\\exists y\\,\(\\text\{Mother\}\(y,x\)\)\), yet without context, the same syntactic pattern could refer to a specific child\. We flag such cases as ambiguous and adopt the universal reading when the sentence expresses a general rule, falling back to the existential reading when the sentence clearly introduces a particular individual\.

## Appendix HExperimental Setup

This section specifies the experimental configuration for evaluating both pipelines\. We benchmark three LLMs using four prompting strategies \(B1–B4\) and four prompt variants \(pv1–pv4\)\. All combinations are evaluated onFOLIO\_validation,MALLS\_test, andGGC\. We describe the prompting strategies \(Section[H\.1](https://arxiv.org/html/2606.02837#A8.SS1)\), the prompt variants \(Section[H\.2](https://arxiv.org/html/2606.02837#A8.SS2)\), the models \(Section[H\.3](https://arxiv.org/html/2606.02837#A8.SS3)\), the data splits \(Section[H\.4](https://arxiv.org/html/2606.02837#A8.SS4)\), and how the logical equivalence check is performed \(Section[H\.5](https://arxiv.org/html/2606.02837#A8.SS5)\)\. All experimental artifacts are released in the accompanying code repository\.

### H\.1Prompting Strategies \(B1–B4\)

We benchmark LLMs using four prompting strategies\. As defined in Section[3\.1](https://arxiv.org/html/2606.02837#S3.SS1), the model receivespp,φ\\varphi, andΩ\\Omega, and must output a verdictvvalongside a refinementψ\\psi\.

B1: Zero\-Shot Chain\-of\-Thought\.\(wei2023chainofthought\)\. Task description and CoT instruction only; no in\-context examples\. The model reasons step\-by\-step over\(p,φ,Ω\)\(p,\\varphi,\\Omega\)before producing a verdict and refinement\. Establishes a floor for instruction\-following and self\-directed reasoning alone\.

B2: Few\-Shot \(no CoT\)\.\(llmfewshot\)\. Task description plus 5 labeled examples covering all verdict classes, each with its ontology\. No reasoning chain is requested\. Tests whether exposure to demonstrations alone suffices\.

B3: Few\-Shot \+ CoT\.Combines B2’s in\-context examples with B1’s CoT instruction\.

B4: Feature\-Based Decomposition\.A form of metapromptingzhang2024metapromptingaisystems\. Rather than a holistic judgement, the model evaluatesφ\\varphiagainst a checklist of four sub\-properties—*syntactic well\-formedness*,*ontological faithfulness*,*truth\-conditional equivalence*, and*ambiguity resolution*—before aggregating\. Tests whether decomposition improves precision on subtle cases\.

Table 3:Overview of the four benchmarked prompting strategies\.
### H\.2Prompt Variants \(pv1–pv4\)

For each prompting strategy B1–B4, we render four prompt variants that vary along two dimensions: the evaluator’s stance toward the candidate formula and the specificity of evaluation guidance\.

pv1–*neutral, minimal\.*A direct task definition positioning the model as an expert linguist\-logician\. No explicit guidance on skepticism or evaluation protocol; the model is expected to assess the formula on its own merits\.

pv2–*cautious, specified\.*Introduces explicit skepticism, instructing the model to “avoid false confidence” and to default to ‘?’ when uncertain\. Any doubt about expressiveness, interpretation, or logical equivalence should trigger the uncertain verdict rather than a binary yes/no judgment\.

pv3–*neutral, detailed\.*Similar to pv1 in stance, but instructs the model to produce a thorough analysis of the strengths and weaknesses of the candidate formalization before committing to a verdict, encouraging deeper reasoning prior to the final judgment\.

pv4–*adversarial, rigorous\.*Adopts the most skeptical stance: the model is instructed to assumeφ\\varphiis incorrect until an adversarial analysis fails to find any flaw\. The model systematically probes for common error types \(wrong quantifier, scope inversion, predicate mismatch\) and assignsyesonly after such a search yields no flaw\.

All four variants maintain identical task definitions and examples \(where applicable\) across the three models, ensuring that cross\-model comparisons isolate model capability from prompt engineering\. The full prompt texts are released with the code\.

### H\.3Models

We benchmark three LLMs spanning a closed\-weights commercial system, an open\-weights mixture\-of\-experts model, and an open\-weights dense model:

GPT\-4o\-miniOpenAI’s small frontier model, accessed through the OpenAI API \(versiongpt\-4o\-mini\-2024\-07\-18\)\. Training on API inputs was disabled to prevent the model from being updated with data from the private GGC dataset\.

Qwen3\-30B\-A3B\.Alibaba’s 30B\-parameter mixture\-of\-experts model with 3B active parameters per token, run locally\. Extended chain\-of\-thought reasoning \(thinking mode\) was enabled\.

Gemma 4 31B\-it\.Google’s 31B\-parameter instruction\-tuned dense model, run locally\. Extended chain\-of\-thought reasoning \(thinking mode\) was enabled\.

For GPT\-4o\-mini and Qwen3\-30B\-A3B, the maximum output token limit was set to 2,500, with retries capped at 8,000 tokens\. For Gemma 4, the limit was increased to 8,000 tokens – and retries to 13,000 – to accomodate the longer reasoning traces produced during the thinking phase\.

We performed the local experiments on a Dell PowerEdge R750 running Red Hat Enterprise Linux 8\.7 \(Ootpa\)\. The system is equipped with two Intel\(R\) Xeon\(R\) Platinum 8360Y CPUs at 2\.40 GHz \(72 physical cores in total\), 512 GB of RAM, 4 TB of disk storage, and an NVIDIA A100 GPU with 80 GB of VRAM\.

### H\.4Data Splits

Both pipelines are evaluated on the three curated datasets: FOLIO\_validation\(275 instances\), MALLS\_test\(100 instances\), and the GGC dataset \(213 instances\)\.

### H\.5Logical Equivalence and SMT solver

All logical equivalence checks between FOL formulas are performed using the Z3 SMT solver\(de2008z3\), with a 30\-second wall\-clock timeout per query\. This configuration is used uniformly for the re\-evaluation of LLM performance on the curated datasets, the pipelines evaluation, and NLI label verification\. Queries that exceed the timeout are mapped to anSolverTimeoutErrorand excluded from metric computation\.

Z3 is invoked under two configurations depending on the dataset:

Raw \(FOLIO and MALLS\)\.No additional axioms; equivalence is decided as a pure first\-order question over the formulas themselves\.

With axioms \(GGC\)\.Since GGC formalizes objects in Tarski’s World, the relational symbols used are governed by domain constraints\. 41 axioms of the domain \(symmetry of converses such asSmaller/Larger, the symmetry and reflexivity ofSameSize,SameShape,SameCol,SameRow, the mutual exclusivity of shape and size classes, and statements like\(Larger​\(x,y\)∨Larger​\(y,x\)\)↔¬SameSize​\(x,y\)\(\\textsf\{Larger\}\(x,y\)\\vee\\textsf\{Larger\}\(y,x\)\)\\leftrightarrow\\neg\\textsf\{SameSize\}\(x,y\)\) are added as background theory\. This axiom set is released with the code\.

Timeout errors arise exclusively on GGC, likely due to the added axioms increasing solver complexity; they affect only 0\.4% of GGC instances\. Notice that timeout errors have no effect for the results discussed in Section[2\.3](https://arxiv.org/html/2606.02837#S2.SS3)\.

## Appendix IJustification of the Human\-Review Prioritisation Order

![Refer to caption](https://arxiv.org/html/2606.02837v1/x3.png)Figure 4:Accuracy conditioned on the verdict assigned by the judge, for Pipeline 1 \(blue\) and Pipeline 2 \(red\)\. Each bar pools all eligible variants \(B1–B4×\\timespv1–pv4 for P1; self\-judge B1–B4 for P2\) within each cell, yielding large counts and narrow confidence intervals Error bars are 95% Wilson confidence intervals\.GGCcells are shaded: because the original dataset is effectively error\-free, every Pipeline 1 instance labelled*yes*by the judge corresponds to a correct formula, making the conditioned accuracy equal to1\.001\.00\. Unparseable outputs \(total pooledn=243n=243\) are excluded from the bars because individual cells contain too few observations for reliable estimates\.### I\.1Motivation

Each pipeline does not itself constitute a curation strategy; however, the verdictvvcarries information that can be exploited to define a principled inspection order, directing human reviewers toward the instances most likely to require correction\. The rationale rests on a key empirical claim: that accuracy varies systematically with the verdict produced by the V&R stage\. Figure[4](https://arxiv.org/html/2606.02837#A9.F4)provides the evidence for this claim\.

### I\.2Accuracy by Verdict Group

#### yes\.

Instances assigned verdict*yes*achieve the highestψ^\\hat\{\\psi\}accuracy in every model–dataset–pipeline cell\. Averaged over all 9 cells, Pipeline 1 reaches0\.9330\.933and Pipeline 2 reaches0\.8190\.819\. Despite the gap between Pipeline 1 and Pipeline 2, even Pipeline 2’s*yes*group is substantially more reliable than the other verdict groups, validating the policy of deferring those instances to the last tier of human review\.

#### no\.

Averaged over all 9 cells, Pipeline 1 reaches0\.4970\.497and Pipeline 2 reaches0\.3970\.397\. These low values reflect two failure modes: \(i\) the judge correctly flags an erroneous formula and proposes a refinement that is itself incorrect, and \(ii\) the judge incorrectly flags a correct formula as*no*and proposes an unnecessary \(and potentially wrong\) refinement\. Both cases result inψ^≢φgold\\hat\{\\psi\}\\not\\equiv\\varphi\_\{\\mathrm\{gold\}\}and therefore require human review, confirming that*no*\-verdict instances should be prioritised\.

#### ?\.

Averaged over all 9 cells, Pipeline 1 reaches0\.2800\.280and Pipeline 2 reaches0\.2240\.224\. Crucially,?accuracy is lower thannoaccuracy in 8 out of 9 cells, empirically establishing that?instances should be reviewed beforenoinstances in the queue\.

#### Unparseable outputs\.

A small number of model outputs cannot be assigned to any verdict category because they do not match the expected structured format\. Across all 9 cells the total count of such*unparseable*responses is 243\. Pooling all cells, theψ^\\hat\{\\psi\}accuracy for unparseable outputs is131/243≈0\.54131/243\\approx 0\.54\.

This figure overstates reliability: when the judge produces an unparseable output, the pipeline falls back to the original formula \(φ\\varphiin Pipeline 1, andφ^\\hat\{\\varphi\}in Pipeline 2\) asψ^\\hat\{\\psi\}\(Section[3\.2](https://arxiv.org/html/2606.02837#S3.SS2)\), so this accuracy merely reflects how often the original formula happened to be correct\. Unparseable outputs provide no information about whether correction is needed and should be flagged for immediate review regardless of apparent accuracy\.

#### Summary of Prioritisation Order

Combining the observations above, the pipeline’s review order is:

1. 1\.Unparseable: outcome unknown; review unconditionally\.
2. 2\.?: lowest accuracy \(0\.22–0\.28\); high probability of error or genuine ambiguity requiring human adjudication\.
3. 3\.no:moderate accuracy \(0\.40–0\.50\); likely erroneous but partially addressed by the proposed refinement\.
4. 4\.yes:highest accuracy \(0\.82–0\.93\); the judge is confident and usually correct; review only if capacity allows\.

This ordering directly determines the shape of the human\-oversight curve: the steepest accuracy gains per reviewed fraction occur in the first tiers, so the area under the curve \(AUC\) rewards pipelines that concentrate errors early in the queue\.

## Appendix JPipeline Comparison:T90T\_\{90\},T95T\_\{95\}, and AAG

![Refer to caption](https://arxiv.org/html/2606.02837v1/x4.png)Figure 5:Pipeline comparison across models and datasets under theT90T\_\{90\}metric\.![Refer to caption](https://arxiv.org/html/2606.02837v1/x5.png)Figure 6:Pipeline comparison across models and datasets under theT95T\_\{95\}metric\.![Refer to caption](https://arxiv.org/html/2606.02837v1/x6.png)Figure 7:Pipeline comparison across models and datasets under the AAG metric\.The main paper reports AUC \(area under the curve\) as the primary scalar summary of pipeline quality\. Figure[5](https://arxiv.org/html/2606.02837#A10.F5)\-[6](https://arxiv.org/html/2606.02837#A10.F6)replicate the same 3×3 bar grid \(model × dataset\) for the other complementary metricsT90T\_\{90\}\(fraction of the dataset that must be reviewed to reach 90% accuracy\), andT95T\_\{95\}\(fraction to reach 95% accuracy\)\.

Figure[7](https://arxiv.org/html/2606.02837#A10.F7)complements the analysis with the additional*Area Above Green*\(AAG\) metric: the area of the region in the accuracy–human effort plane that lies between the Green Baseline curve and the curve of the strategy under evaluation\. By construction, the Green Baseline has AAG=0=0, so every positive value directly quantifies the net gain over starting the curation from the datasets obtained by the direct LLM translation\.

All statistical tests use one\-sided paired Wilcoxon signed\-rank matched on prompting strategies × prompt variants within each cell \(n=16n=16pairs\); no multiple\-comparison correction is applied within a cell\.

TheT90T\_\{90\}andT95T\_\{95\}bars \(Fig\.[5](https://arxiv.org/html/2606.02837#A10.F5)–[6](https://arxiv.org/html/2606.02837#A10.F6)\) confirm the AUC picture from a different angle: Pipeline 1 consistently requires reviewing a smaller fraction of the dataset to reach the accuracy thresholds, and the gains are statistically significant in almost all cells\.

## Appendix KModel Comparison on Pipeline 1

![Refer to caption](https://arxiv.org/html/2606.02837v1/x7.png)Figure 8:Model comparison across all three datasets and all four metrics \(AUC, AAG,T90T\_\{90\},T95T\_\{95\}\) for Pipeline 1\. Each bar represents the mean±\\pmstd over all 16 prompting combinations \(B1–B4×\\timespv1–pv4\)\. Brackets indicate pairwise Mann\-Whitney U tests with Bonferroni correction \(three pairs\)\.Figure[8](https://arxiv.org/html/2606.02837#A11.F8)supports the model comparison discussed in the main paper, showing results separately for all three datasets and including the AAG metric \(defined in Appendix[J](https://arxiv.org/html/2606.02837#A10)\)\. Each bar represents the mean±\\pmstd over all 16 prompting combinations \(B1–B4×\\timespv1–pv4\) of Pipeline 1\.

Gemma’s advantage over the other models is consistent acrossFOLIOandMALLS; onGGCthe gap narrows because all models operate near ceiling, as expected for an error\-free control dataset\.

The AAG results require careful interpretation\. OnMALLSandGGC, Gemma’s AAG is lower than that of the other models, which may seem inconsistent with its overall superiority\. The key is that the Green Baseline is model\-dependent: it is anchored to each model’s own direct\-translation accuracy\. AAG therefore measures how much a pipeline improves over that model’s own baseline, not over a common reference\. A stronger model has a higher Green Baseline to beat, so the same absolute accuracy gains translate into a smaller AAG\. This is a property of the metric, not a failure of the pipeline\.

## Appendix LVariant Selection and Sensitivity

The main paper restricts the Pipeline 1 analysis to the subset \{B1, B2, B3\}×\\times\{pv1, pv3\}, excluding B4 and pv2/pv4\. This appendix documents the statistical justification for that restriction and shows that no analogous restriction is warranted for Pipeline 2\.

![Refer to caption](https://arxiv.org/html/2606.02837v1/x8.png)Figure 9:Prompting strategies and variants comparison for Pipeline 1\. Blue\-framed cells mark the retained group \{B1, B2, B3\}×\\times\{pv1, pv3\}\.![Refer to caption](https://arxiv.org/html/2606.02837v1/x9.png)Figure 10:Prompting strategies and variants comparison for Pipeline 2Figures[9](https://arxiv.org/html/2606.02837#A12.F9)and[10](https://arxiv.org/html/2606.02837#A12.F10)show 4×\\times4 heatmaps \(prompting strategies×\\timesprompt variants\) of AUC,T90T\_\{90\}, andT95T\_\{95\}for Gemma on Pipeline 1 and Pipeline 2, respectively, averaged overFOLIOandMALLS\. In Figure[9](https://arxiv.org/html/2606.02837#A12.F9), the blue\-framed cells mark the retained group, which is statistically equivalent within itself and superior to the excluded configurations \(B4 and pv2/pv4\); higher \(lighter\) cell values indicate worse performance, confirming that the statistical finding matches the visual pattern\.

We use a two\-stage non\-parametric approach\. In stage 1, a Friedman test \(non\-parametric repeated\-measures ANOVA\) tests the global null that all levels of a factor are exchangeable\. The blocking structure is: for the prompt strategy factor, each block is one \(dataset × prompt variant\) combination \(8 blocks, 4 levels\); for the prompt\-version factor, each block is one \(dataset × prompting strategy\) combination \(8 blocks, 4 levels\)\. In stage 2, all\(42\)=6\\binom\{4\}\{2\}=6pairs are tested with two\-sided paired Wilcoxon signed\-rank; p\-values are Bonferroni\-corrected by multiplying by 6\.

#### Pipeline 1\.

The Friedman test is significant for both the prompting strategy factor \(AUC:χ2=14\.55\\chi^\{2\}=14\.55,p=0\.002p=0\.002;T90T\_\{90\}:p=0\.001p=0\.001;T95T\_\{95\}:p=0\.002p=0\.002\) and the prompt variant factor \(AUC:χ2=11\.25\\chi^\{2\}=11\.25,p=0\.010p=0\.010;T90T\_\{90\}:p<0\.001p<0\.001;T95T\_\{95\}:p=0\.003p=0\.003\)\. Post\-hoc tests identify B4 as significantly inferior to B2 and B3 \(pbonf≤0\.047p\_\{\\text\{bonf\}\}\\leq 0\.047\), and pv2 and pv4 as significantly inferior to pv1 \(pbonf≤0\.047p\_\{\\text\{bonf\}\}\\leq 0\.047\)\. The pairs B1–B2, B1–B3, B2–B3 and the pairs pv1–pv3 are all non\-significant, justifying retaining all three and both prompt variants in the best\-combination group\.

#### Pipeline 2\.

The Friedman test is non\-significant for both the prompting strategy factor \(χ2=1\.85\\chi^\{2\}=1\.85,p=0\.583p=0\.583\) and the prompt variant factor \(χ2=0\.15\\chi^\{2\}=0\.15,p=0\.985p=0\.985\), with all post\-hoc pairs also non\-significant\. This indicates that the performance landscape for Pipeline 2 is essentially flat across prompting combinations: no restriction is statistically warranted\. For the comparison in the main paper \(Figure[3](https://arxiv.org/html/2606.02837#S5.F3)\), we therefore do not apply the \{B1, B2, B3\}×\\times\{pv1, pv3\} restriction to Pipeline 2; instead, we select the configuration with the highest AUC, which provides the strongest possible case for Pipeline 2 and makes the superiority of Pipeline 1 all the more notable\.

## Appendix MOversight Band Grid and Best Variant Results

![Refer to caption](https://arxiv.org/html/2606.02837v1/x10.png)Figure 11:Extension of Figure[3](https://arxiv.org/html/2606.02837#S5.F3)to all the models and datasets\. Each band shows mean±\\pmstd over the 16 variants for Pipeline 1 \(blue\), and for Pipeline 2 \(red\); the green line represents the Green Baseline\.Figure[11](https://arxiv.org/html/2606.02837#A13.F11)shows the full 3×\\times3 grid of oversight curves \(model×\\timesdataset\) for Pipeline 1 \(blue band\), Pipeline 2 \(red band\), and the direct\-translation Green Baseline \(green line\)\. The figure generalizes the Gemma\-only plot from the main paper \(Fig\.[3](https://arxiv.org/html/2606.02837#S5.F3)\) and supports three observations:

- •Ordering is universal\.The Pipeline 1\>\>Pipeline 2\>\>Green Baseline ordering \(by AUC\) visible for Gemma holds for GPT\-4o\-mini and Qwen as well on FOLIO and MALLS\. The relative gap between Pipeline 1 and Pipeline 2 is largest for Gemma\.
- •GGCis near\-ceiling for all models\.All three models with Pipeline 1 converge to≥\\geq0\.95 accuracy almost immediately\. The oversight gain is real but compressed, confirmingGGC’s role as an error\-free control\.

For completeness, Table[4](https://arxiv.org/html/2606.02837#A13.T4)reports, for each model, dataset, and pipeline, the two prompt configurations with the highest AUC, together with their scores under all metrics considered \(AUC,T90T\_\{90\},T95T\_\{95\}\)\. For comparison, we report also for each model and dataset the scores for the Black Baseline and the Green Baseline\.

Table 4:Table representing the AUC,T90T\_\{90\}, andT95T\_\{95\}of the best 2 variants \(according to AUC\) for each model, dataset and pipeline

Similar Articles

Auditing LLM Benchmarks with Item Response Theory

arXiv cs.CL

This paper introduces an Item Response Theory-based method to detect mislabeled examples in LLM benchmarks at 95% precision, tracing errors to labeling heuristics and annotation issues.

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Hugging Face Daily Papers

This paper investigates the alignment of LLM-generated reviews with human judgment using 1k real ACL 2025 submissions, finding limited agreement, instability across models/prompts, and a method to artificially inflate scores without meaningful changes. The authors advise against relying solely on LLM reviews and call for discussion on their use in handling increasing submission volumes.