Does My Embedding Reflect That $A = B$? Evaluating Mathematical Equivalence in Embedding Models
Summary
This paper introduces the MELD dataset for evaluating whether text embedding models capture mathematical equivalence across different terminologies, and finds that current models fail. It proposes a contrastive learning approach to align informal and formal mathematical statements, improving retrieval on both informal-formal and natural language tasks.
View Cached Full Text
Cached at: 06/24/26, 07:44 AM
# Does My Embedding Reflect That 𝐴=𝐵? Evaluating Mathematical Equivalence in Embedding Models
Source: [https://arxiv.org/html/2606.23959](https://arxiv.org/html/2606.23959)
\\theorembodyfont\\theoremheaderfont\\theorempostheader
:\\theoremsep \\jmlrvolume334\\jmlryear2026\\jmlrworkshopTopology, Algebra, and Geometry in Data Science
\\NameJiaying Ye\\nametag\\Emailyjy0509@uw\.edu \\NameSamarth Rao\\nametag∗ \\NameLeo Carlin\\nametag∗ \\NameKedar Chintalapati \\NameSaharsh Bhargava \\NameRachit Jaiswal \\NameMichael Zhou \\NameJared Darlington \\NameJarod Alper \\addrUniversity of Washington\\NameVasily Ilin \\addrAxiom MathUniversity of Washington\\NameHenry Kvinge\\Emailhjk3@uw\.edu \\addrUniversity of WashingtonPacific Northwest National Laboratory
###### Abstract
Because mathematics is highly abstract, a single statement can take very different forms depending on what subfield it is framed in\. There are many examples where breakthroughs occurred after researchers discovered that a question had already been answered in a different field\. At the same time, the growth of new resources related to formalization has increased the need for tools that enable efficient and reliable navigation between mathematical ‘languages’ \(e\.g\., from Lean to natural language\)\. In this paper, we investigate whether current embedding models capture mathematical equivalence\. To do this, we introduce the*Mathematically Equivalent but Lexically Different Pairs \(MELD\) Dataset*, a collection of mathematically equivalent statements that are expressed in very different language\. We show that current state\-of\-the\-art embedding models tend to group statements by the terminology used to make them instead of the underlying math\. Motivated by this, we propose a contrastive approach to learning embeddings of mathematical text that focuses on aligning informal statements with different formalizations\. Our experiments demonstrate that this leads to improvements not only on informal\-formal retrieval tasks but also on MELD, which only contains natural language statements\.
###### keywords:
AI for math, text embeddings, Lean
## 1Introduction
Language models have become increasingly capable of research\-level mathematics with several documented cases of LLMs like GPT\(tao2024primetime;gowers2026chatgpt;openai2026unitdistance\), Gemini\(feng2026semi\), and Claude\(knuth2024news\)helping mathematicians make progress on open problems\. One of the advantages of these systems is their ability to draw on the vast mathematics literature \(through both parametric knowledge and internet search\) at a scale that is inaccessible to humans\. This includes making non\-trivial connections that extend beyond surface\-level lexical overlap\.
At the same time, the ability to effectively navigate large collections of mathematical text is becoming essential not only because of the ever\-growing number of mathematics papers available in online databases like the arXiv, but also because of new formalization\-driven resources likemathlib\(mathlib2020\)\. Although state\-of\-the\-art LLMs may already have the capability to perform this type of navigation, they are too expensive computationally to scale to many applications\. On the other hand, text embedding models are explicitly designed to capture semantic relationships within a corpus of text efficiently\. In this paper, we ask whether current state\-of\-the\-art embedding models can capture mathematical equivalence\.
To test this, we introduce a new dataset,*Mathematically Equivalent Lexically Different Pairs \(MELD\)*, a collection of 270 pairs of mathematically equivalent statements expressed in different terminology\. One simple example is the definition of a product of two sets in terms of set theory versus category theory\. We then show that state\-of\-the\-art embedding models tend not to embed these pairs near one another, instead grouping statements based on the mathematical subfield they are framed in\.
An embedding model that could group statements based on mathematical equivalence rather than lexical similarity would be a powerful tool for mathematical discovery\. Motivated by this, we propose a contrastive approach to training embedding models for mathematics that uses formal and informal versions of a statement as different “views” of the same underlying mathematics\. To realize this, we generate an augmented version of the`mathlib\_informal\_v4\.16\.0`dataset\(frenzymath2025mathlibinformal\), which contains Lean statements from themathliblibrary along with informal descriptions\. We also add rephrasings of the informal descriptions\. The two models we train under this setup,`MathLeap\-Qwen\-8B`and`MathLeap\-Octen\-8B`, achieve strong results across a range of mathematics retrieval tasks111Code can be found at[https://github\.com/uw\-math\-ai/math2vec](https://github.com/uw-math-ai/math2vec)\.\. Supporting our hypothesis that our approach to training encourages a model to learn representations that reflect the mathematical content of a statement, we show that both models achieve the highest scores on MELD from among the models that we evaluated\. This is particularly notable since MELD only contains natural language content\.
In summary, our contributions include:
1. 1\.We introduce*Mathematically Equivalent but Lexically Different Pairs \(MELD\)*, a dataset with 270 pairs of mathematical statements that are lexically dissimilar but mathematically equivalent\.
2. 2\.Observing that most embedding models prioritize language similarity over mathematical equivalence, we propose a contrastive approach to developing embedding models for mathematics, using informal and formal representations of a statement as different views of the underlying mathematical content\.
3. 3\.We show that this leads to significant improvements on MELD and a range of other mathematical retrieval tasks\.
## 2Related work
The value of being able to efficiently leverage the mathematics literature has meant that retrieval and search are an active area of research in the AI\-for\-math community\. For example,\(welleck2021naturalproofs\)introduced a dataset of mathematical theorems, treating theorem search as a reference retrieval task\. This showed that text embedding models can be effective when retrieval only depends on capturing surface\-level similarities between mathematical statements but may be less effective when deeper mathematical connections are required\. More recently,\(alexander2026semantic\)used embeddings from Qwen3\-Embedding\-8B to perform retrieval over 9 million theorems\. On the other hand, our work aims to improve the capabilities of text embedding models themselves, focusing specifically on capturing mathematical equivalence\.
More recently\(ju2025mirb\), proposed a full text embedding benchmark for mathematical retrieval, Mathematics Information Retrieval Benchmark \(MIRB\)\. This benchmark includes tasks that mix less rigorous language \(e\.g\., search for MathStackExchange duplicates\) to more formal language \(e\.g\., Lean premise retrieval\)\. Our work differs from MIRB, as MELD is designed to target a specific embedding model capability rather than general mathematics retrieval\.
## 3Do Embedding Models Capture Mathematical Equivalence?
Figure 1:A UMAP visualization of embedded statements from the “vector spaces vs module theory” subset of MELD\. Blue points correspond to statements framed in terms of vector space terminology, brown points correspond to statements framed in terms of module theory\. Small triangles are distractor statements\. MELD pairs are large dots connected by lines\. Embedded statements cluster by subfield rather than mathematical equivalence\.One of the most attractive features of mathematics \(and also one of the features that can make mathematics so hard\) is that a single statement can often be described precisely in two radically different ways\. These “one thing, two perspectives” situations are the foundation of many fruitful areas of mathematics \(e\.g\., algebraic geometry, algebraic topology, and geometric group theory\)\.
Table 1:Subfield pairs used to generate MELDMotivated by this, we introduce*Mathematically Equivalent but Lexically Different Pairs \(MELD\)*222MELD can be found at[https://huggingface\.co/datasets/uw\-math\-ai/MELD\-dataset](https://huggingface.co/datasets/uw-math-ai/MELD-dataset), a collection of 270 statement pairs written in natural language and LaTeX where paired statements are equivalent but cannot be easily matched by surface\-level lexical similarity\. One example is the following:
> Identity map \(Set theory\)For a setAA, the mapidA:A→A\\mathrm\{id\}\_\{A\}:A\\to Adefined byidA\(a\)=a\\mathrm\{id\}\_\{A\}\(a\)=a\.
> Identity map \(Category theory\)For each objectAA, a distinguished arrowidA:A→A\\mathrm\{id\}\_\{A\}:A\\to Asuch thatf∘idA=ff\\circ\\mathrm\{id\}\_\{A\}=fandidA∘g=g\\mathrm\{id\}\_\{A\}\\circ g=gfor any objectsBBandCCandf:A→Bf:A\\rightarrow Bandg:C→Ag:C\\rightarrow A\.
The 270 pairs come from nine different pairs of domains that offer distinctive descriptions of a common set of ideas \(e\.g\., representation theory and Fourier analysis\)\. These are summarized in Table[1](https://arxiv.org/html/2606.23959#S3.T1)\.
Table 2:Retrieval recall of mathematically equivalent statements from MELD\. The best score or scores for each metric are written in bold\.We generated MELD by iterating through each of the nine pairs of complementary domains, describing the connection between the two fields, and prompting Claude Opus 4\.7 to generate 30 pairs of mathematically equivalent but lexically distinct statements\. We then manually reviewed these, making modifications to increase dissimilarity while preserving equivalence\. We then evaluated all statements using GPT\-5\.5 \(medium\) to check \(i\) whether both statements were valid, \(ii\) whether they were equivalent, and \(iii\) whether they could be made to sound less similar\.
While using LLM\-generated datasets carries risks, all content belongs to a basic graduate curriculum in mathematics, and hence should be well within the capabilities of current LLMs\. This also means that embedding model failures on MELD are less likely to be attributable to the subject matter being esoteric and more likely to point toward limitations in the way that these models represent mathematical text\.
To understand the extent to which embeddings of mathematical text are organized by mathematical equivalence rather than just lexical similarity, we embed all 540 statements using some of the best current embedding models and look at whether equivalent pairs are grouped near each other\. This includes`Qwen3\-Embedding\-4B`\(qwen3embedding\),`Qwen3\-Embedding\-8B`\(qwen3embedding\),`harrier\-oss\-v1\-27b`\(microsoft2026harrier\),`KaLM\-Embedding\-Gemma3\-12B`\(hu2025kalmembedding;zhao2025kalmembeddingv2\),`llama\-embed\-nemotron\-8b`\(babakhin2025llamaembednemotron8buniversaltextembedding\), and`Octen\-Embedding\-8B`\(octen2025rteb\)\.
We find that statements are mostly clustered by the field they are expressed in\. For instance, in Table[2](https://arxiv.org/html/2606.23959#S3.T2)the model that achieves the best results, Octen\-Embedding\-8B, has a recall@1 of only 25\.0\. To get a qualitative sense of why this is, in Figure[1](https://arxiv.org/html/2606.23959#S3.F1)we use UMAP\(mcinnes2018umap\)to reduce the dimensionality of the embedding and then visualize statements from 30 pairs where one element is written in the language of vector spaces \(blue\) and the other is written in the language of module theory \(brown\)\. Paired statements are connected by gray lines\. The model appears to strongly clusters statements by field rather than equivalence\.
## 4A Contrastive Approach to Embedding Mathematical Text
The experiments in Section[3](https://arxiv.org/html/2606.23959#S3)suggest that current embedding models capture the lexical similarity of mathematical statements rather than their mathematical equivalence\. Motivated by this, we consider training strategies that steer models toward embeddings that cluster statements by mathematical equivalence rather than surface\-level similarities in language\. While training on MELD has the potential to do this, it is small and there is not an easy way to cheaply scale it since it depends on state\-of\-the\-art LLMs\. Fortunately, the growth of formalization has begun to create new resources that address this need\.
mathlibis essential for formal proof development in Lean because it provides formalizations of central definitions and theorems from modern mathematics\.mathlibwas used to generate the`mathlib\_informal\_v4\.16\.0`dataset\(frenzymath2025mathlibinformal\)\. Each entry in this dataset corresponds to amathlibstatement, which comes in three different forms: the Lean signature, the Lean type, and a natural language summary\. We treat these as different “views” on a single underlying mathematical statement\. Our assumption is that a consequence of steering a model toward alignment of formal and informal versions of statements, will be embeddings that better capture mathematical equivalence\.
### 4\.1Training Our MathLeap Models
In addition to the natural language, Lean signature, and Lean type views inherited from`mathlib\_informal\_v4\.16\.0`, we also generate natural language rephrasings for some instances in the training set\. We do this by first using`Qwen3\.5\-9B`to decompose natural language theorems into explicit hypotheses and conclusions\. We then provide these to`gemma\-4\-E4B\-it`to generate rephrasings\. When doing this,`gemma\-4\-E4B\-it`is prompted to change superficial aspects of the theorem statement like variable names\. More details, including the prompts that were used, can be found in Section[B\.1](https://arxiv.org/html/2606.23959#A2.SS1)\. Our final training set contains 118,334 instances \(with three to four views each\) while our test set contains 15,287 instances333[https://huggingface\.co/datasets/uw\-math\-ai/Math2Vec\-embedding\-dataset](https://huggingface.co/datasets/uw-math-ai/Math2Vec-embedding-dataset)\. Approximately 85% of the training set had a natural language rephrasing attached to it\.
Since our initial experiments showed that`Qwen3\-Embedding\-8B`\(qwen3embedding\)and`Octen\-Embedding\-8B`\(octen2025rteb\)are the most effective existing models for mathematical retrieval tasks, we chose to use these as base models for finetuning\. We trained using Cached Multiple Negatives Ranking Loss from the Sentence\-BERT library\(reimers\-2019\-sentence\-bert\), a contrastive loss function that uses in\-batch negatives while avoiding common issues that arise when trying to scale contrastive methods to GPU training\. We used the default scale value of 20\.0\. We used AdamW, 700 warm\-up steps, a constant scheduler, a batch size of 16, and a max sequence length of 128\. Models were trained for approximately two epochs \(∼\\sim14,790 steps\)\. Training was performed on a single H200 and took approximately 8 hours\. We call the resulting models`MathLeap\-Qwen\-8B`444[https://huggingface\.co/uw\-math\-ai/MathLeap\-Qwen\-8B](https://huggingface.co/uw-math-ai/MathLeap-Qwen-8B)and`MathLeap\-Octen\-8B`555[https://huggingface\.co/uw\-math\-ai/MathLeap\-Octen\-8B](https://huggingface.co/uw-math-ai/MathLeap-Octen-8B)respectively\.
### 4\.2Evaluation
We evaluate`MathLeap\-Qwen\-8B`and`MathLeap\-Octen\-8B`on a range of mathematical retrieval tasks\. Overall, we find that our contrastive approach to training a mathematics embedding model significantly improves performance on tasks where retrieving mathematically equivalent statements is critical\.
#### MELD
Finetuning`Qwen3\-Embedding\-8B`results in recall@1 increasing from 17\.0 to 27\.2 and recall@5 increasing from 47\.8 to 63\.1 \(Table[2](https://arxiv.org/html/2606.23959#S3.T2)\)\. At the same time, the average rank of the mathematically equivalent pair among all neighbors decreases from 16\.1 to 9\.7 \(Table[5](https://arxiv.org/html/2606.23959#A0.T5)\)\. Both`MathLeap\-Qwen\-8B`and`MathLeap\-Octen\-8B`independently score higher on MELD than any of the other models we evaluated on each of the metrics we considered\. This is particularly notable since natural\-language\-only retrieval is a small part of the training curriculum of our models\. We see this as supporting our hypothesis that our contrastive style of training does not just help with the specific task of aligning representations of formal and informal mathematical statements, it also encourages representations that capture mathematical equivalence\.
Table 3:Retrieval scores on the blueprints dataset\. Retrieval was performed by giving the informal description and asking the model to find the corresponding Lean statement\. The best score or scores for each metric are written in bold\.
#### Blueprints
Lean blueprints are dependency graphs used to track progress on formalization projects\. Nodes in these graphs typically contain both a Lean statement and a natural language description\. We used these pairs across 68 graphs to create a dataset of 3,729 Lean and natural language pairs\. We measured recall on retrieval of the corresponding Lean statement given a natural language query\. Across all models that we evaluated,`MathLeap\-Qwen\-8B`or`MathLeap\-Octen\-8B`performed best across the evaluated recall metrics \(Table[5](https://arxiv.org/html/2606.23959#A0.T5)\)\. This is not as surprising as our results on MELD, since the blueprints task aligns closely with the training task for both MathLeap models\.
Table 4:nDCG@10 scores for three MIRB tasks\. The best score or scores for each metric are written in bold\.
#### MIRB
We evaluate`MathLeap\-Qwen\-8B`and`MathLeap\-Octen\-8B`on three tasks from the MIRB benchmark: MathOverflow duplicate question retrieval,mathlibretrieval, and Lean premise retrieval \(originally from\(yang2023leandojo\)\)\. We measure performance using nDCG@10, following\(ju2025mirb\)\.`MathLeap\-Octen\-8B`performs best onmathlibretrieval, which is well\-aligned with its training set\. On the other hand, we see reduced performance on the other two tasks\. We analyze these trade\-offs in the Limitations section \(Section[5](https://arxiv.org/html/2606.23959#S5)\)\.
## 5Limitations
There are a number of limitations to this work that should be highlighted\. Most importantly, we note that on some tasks such as MathOverflow duplication retrieval and Lean premise retrieval, the performance of`MathLeap\-Qwen\-8B`and`MathLeap\-Octen\-8B`decreases relative to their base models\. We suspect this is due to our models becoming over\-specialized during finetuning\. For example, MathOverflow duplication retrieval may require a broader natural\-language understanding outside of rigorous theorem statements, an ability that may degrade in our models during finetuning\. On the other hand, Lean premise retrieval involves using a proof\-state as a query andmathlibdeclarations as the corpus\. Our training set does not include any proof\-states so it is not surprising that performance degrades\. Overall, this points to the need for a more diverse training set if our goal is a generalist math embedding model\.
We also note that we depend on synthetic data for a number of our datasets\. MELD depends on frontier LLMs for identification and generation of basic theorems and definitions, as well as translation of these to the language of a different mathematical subfield\. The training set of`MathLeap\-Qwen\-8B`and`MathLeap\-Octen\-8B`uses rephrasing generated by local models\. In both cases, we have attempted to remove any erroneous instances, but we cannot guarantee that all errors have been eliminated\.
## 6Conclusion
The ability of an embedding model to capture mathematical equivalence rather than just surface\-level textual similarity is important to mathematical discovery\. In this paper, we ask whether today’s state\-of\-the\-art embedding models are capable of retrieving mathematically equivalent statements from among lexically similar distractors\. Using a new dataset, Mathematically Equivalent but Lexically Different Pairs \(MELD\), we find that current models struggle to do this\. This leads us to propose a contrastive approach to mathematics\-specific embedding model training using informal and formal statements as different views of the underlying mathematics\. Using this approach, we train`MathLeap\-Qwen\-8B`and`MathLeap\-Octen\-8B`\. These models display substantial improvement across many mathematics retrieval tasks and suggest a path toward training even stronger mathematics specific embedding models in the future\.
## References
Table 5:MMR and mean pair rank for embedding models on MELD\. The best score or scores for each metric are written in bold\.## Appendix AFrenzyMath Retrieval Training and Evaluation
In this section we describe the training and testing details on our augmented version ofFrenzyMath/mathlib\_informal\_v4\.19\.0frenzymath2025mathlibinformal419\. We focus on mathematics\-related instances \(as opposed to those related to Lean infrastructure\), retaining examples whosemodule\_namefield has a second component in the following set:
- •Algebra,
- •AlgebraicGeometry,
- •AlgebraicTopology,
- •Analysis,
- •CategoryTheory,
- •Combinatorics,
- •Computability,
- •Condensed,
- •Control,
- •Data,
- •Dynamics,
- •FieldTheory,
- •Geometry,
- •GroupTheory,
- •InformationTheory,
- •LinearAlgebra,
- •Logic,
- •MeasureTheory,
- •ModelTheory,
- •NumberTheory,
- •Order,
- •Probability,
- •RepresentationTheory,
- •RingTheory,
- •SetTheory,
- •Topology\.
Our models are trained on an exact paired\-retrieval task\. During evaluation, queries are drawn from the held\-out test split but the retrieval corpus is drawn from the union of the train, validation, and test splits\. The model never sees validation or test examples during training\. A retrieved item is counted as correct only if it matches the row\-aligned target for the query\. All embeddings areL2L\_\{2\}\-normalized, and retrieval is performed by exact nearest\-neighbor search in the shared embedding space\.
Tables[6](https://arxiv.org/html/2606.23959#A1.T6)\-[11](https://arxiv.org/html/2606.23959#A1.T11)report six directed retrieval tasks: informal description to Lean type, Lean type to informal description, informal description to Lean signature, Lean signature to informal description, Lean type to Lean signature, and Lean signature to Lean type\. For instruction\-aware models, the task instruction is applied only to the query side, while the corpus side is embedded as the fixed candidate bank\. We report Recall@1 \(R@1\), Recall@5 \(R@5\), Recall@10 \(R@10\), and mean reciprocal rank \(MRR\) for the base embedding models used in the main experiments and our fine\-tuned variants,MathLeap\-Qwen\-8BandMathLeap\-Octen\-8B\.
Table 6:Informal description to Lean type\. The best score or scores for each metric are written in bold\.Table 7:Lean type to informal description\. The best score or scores for each metric are written in bold\.Table 8:Informal description to Lean signature\. The best score or scores for each metric are written in bold\.Table 9:Lean signature to informal description\. The best score or scores for each metric are written in bold\.Table 10:Lean type to Lean signature\. The best score or scores for each metric are written in bold\.Table 11:Lean signature to Lean type\. The best score or scores for each metric are written in bold\.
## Appendix BTraining data generation
### B\.1Prompts for theorem decomposition, rephrasing, and hard negative generation
In this section we provide the prompts that we used to generate the natural language rephrasings and hard negatives\. The first step in this process was to down select to theorems in`FrenzyMath/mathlib\_informal\_v4\.16\.0`\. We then had`Qwen/Qwen3\.5\-9B`decompose each of these into assumptions and conclusions using the following prompt:
You are a mathematical statement parser\.Task:Given a single research\-levelmathematical statement\(theorem, lemma, proposition,corollary, claim, definition\-likeimplication, or assertion\), decompose itinto:1\. hypotheses/assumptions2\. conclusionsAlso rewrite the theorem innormalized\_form which mustbe a single string of the form:"If \[H1\] and \[H2\] and \.\.\. then \[C\]\."The idea is that by explicitly stating the assumptions and conclusion, an LLM will be less likely to change the theorem in ways that make it mathematically different when rephrasing and similarly, it will be easier to control how an LLM changes a theorem to make it a hard negative \(we generated hard negatives, as we describe below, but these were not used in development of the final models\)\.
We gave theorem decompositions to`google/gemma\-4\-E4B\-it`and requested rephrasings using the prompt:
You are a mathematical writer helpingto build a training dataset for an \\embedding model\. Your task is toproduce alternative natural language \\phrasings of mathematical statementsthat preserve their meaning exactly\.You are given the original statementtogether with a pre\-computeddecomposition into hypotheses andconclusions\. Use that decompositionas your guide to reassemble thecomponents into a rephrasing witha different surface form\.The rephrasing must be mathematicallyidentical to the original — it must \\make the same claims with the sameassumptions — but it should differin at least one of the following ways:\- Sentence structure \(e\.g\. "If Athen B" → "Whenever A, we have B"\)\- Logical connective style \("Let \.\.\.Then \.\.\." → "Given \.\.\., it follows \.\.\."\)\- Quantifier ordering or grouping\- Passive vs active voice\- Mathematical phrasing conventions\(e\.g\. "there exists" → "we can find"\)\- Naming convention \(e\.g\. "asubgroup H" → "a subgroup, which wecall H"\)\- Consistently swap variables so"a \+ b = c" becomes "s \+ r = p" butbe sure to keep conistency throughthe whole statement\.Overall, the phasing should be asdifferent as possible while remainingcompletely mathematically equivalent\.The rephrasing should read as naturalmathematical prose — the kind you would \\find in a textbook or research paper\.Include LaTeX notation where the \\original does\.Output the rephrased statementdirectly — no preamble, no explanation, \\no markup\. Just the rephrasing itself\."""USER\_TMPL = """\\Mathematical statement:\\"\\"\\"STATEMENT\\"\\"\\"Pre\-computed decomposition:Hypotheses:HYPOTHESESConclusions:CONCLUSIONSRephrasing:"""RETRY\_SUFFIX = \("\\n\\nYour previous responsewas empty\. ""Output only the rephrasedstatement — no preambleor explanation\."\)Finally, we also prompted`Qwen/Qwen3\.5\-9B`to give us three hard negatives for the theorem using the prompt \(note that we found that training with hard negatives degraded performance so they were omitted from the final training routine\):
You are an expert mathematical editor\.Task:Given a string with"input\_statement", "hypotheses","conclusions", "normalized\_form","subject"\), generate exactly 3"hard\_negatives" — statementsthat closely mimic the visual andsyntactic structure of the"input\_statement" but are strictlymathematically false orsemantically different\.Guidelines for Hard Negatives\(This is not an exhaustive listof guidelines\):1\. Subtle Omission: If it wouldcreate a mathematically differentstatement then omit a property thatis a prerequisite for the conclusion,or remove a specific conclusion2\. Commutative Flip: If it wouldcreate a mathematically differentstatement then reverse the order ofa product that is not commutable ina way that is plausible but strictlybreaks the equality\.3\. Negation: If a negation wouldcreate a mathematically differentstatement, then negate an assumptionor conclusion to fundamentally alterthe semantic meaning of the statement\.Validity:\- Every variable used in the conclusionmust be defined in the hypotheses\.Every operator must act on the correcttype of object\.\- The statement must be read as avalid, grammatically correct mathematicalsentence, even though the underlyingclaim is false\.\- It cannot simply be a "weaker" truestatement\. If the original theoremguarantees x \> 0, generating a negativethat says x \\ge 0 is invalid because itis still a true mathematical statement\.Do not merely omit a conclusion orweaken an inequality without introducinga strict contradiction or a falsemathematical guarantee\.Style Constraints:\- Preserve the original LaTeX formattingstyle, notation, and complexity\.\- The visual "weight" of each equationmust match the original theorem exactly\.
### B\.2The role of hard negatives
We generated hard negatives for informal text that were designed to look lexically similar but have a different mathematical meaning\. Often, this modification made the statement false\.
During initial fine\-tuning runs, we attempted to use the hard negatives alongside the rephrasings\. However, we found that with our training techniques, fine\-tuning with the hard negatives degraded performance\. We therefore decided to only use the rephrasings\. We leave hard negatives as a tool to explore in future research\.Similar Articles
Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems
This paper benchmarks Google Embeddings 2 against five open-source models for multilingual dense retrieval and RAG, finding GE2 top in accuracy but slower, with mE5-L as a competitive low-latency alternative.
Your Embedding Model is SMARTer Than You Think
SMART is a framework that unlocks latent multi-vector capabilities in single-vector models for multimodal retrieval, improving state-of-the-art performance with reduced computational costs via contrastive training and late-interaction inference.
Embeddings for Preferences, Not Semantics
This paper introduces a new embedding model designed to capture preferential similarity rather than just semantic similarity, improving preference prediction for collective decision-making systems.
Models Can Model, But Can't Bind: Structured Grounding in Text-to-Optimization
This paper introduces Text2Opt-Bench, a scalable benchmark for text-to-optimization, and identifies that LLMs struggle with 'binding' (grounding problem data) rather than 'modeling' (choosing optimization structure). The authors propose BIND, a simple inference-time method that externalizes numeric data, significantly improving accuracy across models.
TabularMath: Understanding Math Reasoning over Tables with Large Language Models
TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.