Characterizing initial human-AI proof formalization workflows

arXiv cs.AI 06/04/26, 04:00 AM Papers
Summary
Researchers from Oxford, Cambridge, MIT, CMU and other institutions conduct a mixed-methods study examining how people integrate AI tools into mathematical proof formalization workflows, finding that participants generally achieve higher formalization accuracy with AI assistance while preferring to maintain high-level human control over the proof discovery process.
arXiv:2606.04273v1 Announce Type: new Abstract: For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems' ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people's ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a mixed-methods analysis into the initial impact of AI on people's formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice. A qualitative survey shows that people's preferences are diverse, but with a general desire for AI assistance in formalization that preserves high-level human control over the proof discovery process. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement.
Original Article
View Cached Full Text
Cached at: 06/05/26, 02:06 AM
# Characterizing initial human-AI proof formalization workflows
Source: [https://arxiv.org/html/2606.04273](https://arxiv.org/html/2606.04273)
\\undefine@key

newfloatplacement\\undefine@keynewfloatname\\undefine@keynewfloatfileext\\undefine@keynewfloatwithin

Katherine M\. Collins\*Simon Frieder\*University of OxfordJonas BayerUniversity of CambridgeJacob LoaderUniversity of CambridgeJeck LimCaltechPeiyang SongCaltechCarnegie Mellon UniversityFabian ZaiserMassachusetts Institute of TechnologyLexin ZhouPrinceton UniversityShanda LiCarnegie Mellon UniversitySam LooiUniversity of OxfordJoshua B\. TenenbaumMassachusetts Institute of TechnologyUmang BhattUniversity of CambridgeAdrian WellerUniversity of CambridgeJose Hernandez\-OralloUniversitat Politècnica de ValènciaUniversity of CambridgeCameron E\. Freer†Massachusetts Institute of TechnologyValerie Chen†Carnegie Mellon UniversityIlia Sucholutsky†New York University

###### Abstract

For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge\. Advances in AI systems’ ability to generate code and engage in increasingly high\-level mathematical reasoning promise to transform people’s ability to formalize and thereby verify proofs\. While many works focus on benchmarking the current frontier, we instead study how people use these tools\. We conduct a mixed\-methods analysis into the initial impact of AI on people’s formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice\. A qualitative survey shows that people’s preferences are diverse, but with a general desire for AI assistance in formalization that preserves high\-level human control over the proof discovery process\. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains\. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools\. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement\.

Mathematics has long been a bastion of intelligence\. Artificial intelligence \(AI\) is making tremendous strides in mathematical reasoning, attaining a gold medal in the International Math Olympiad\(deepmind2025imo\)and even helping mathematicians crack long\-standing research problems\(davies2021advancing;romera2024mathematical;ellenberg2025generative;liu2025ai;epochai2026frontiermath;hariharan2026milestoneformalizationspherepacking;openai2026unitdistance\)\. But with all of this progress, it remains an open question of how AI may impact—and already is impacting—everyday mathematical practice, both among professionals and the next generation of mathematicians early in learning the trade\. How and where AI will impact mathematical practice is not a function solely of the capabilities of AI systems, but where people actuallywantAI to influence their workflows and how they construct and instantiate new workflows in practice\.

Here, we focus on one particular part of mathematical workflows: formal theorem proving\. For centuries, mathematicians have engaged in the practice of discovering and writing proofs as a way to rigorously argue for or against posited conjectures\. While the practice of theorem proving has been cultivated as a way to get at and affirm “truth,” it has been historically hard to formally verify that proofs are correct\(avigad2010understanding;hales2017formal\)\. This has led some to speculate that many proofs may have small errors \(and even some, consequential errors\(lamport2023errors\)\)\. Languages for formal mathematics \(e\.g\., Lean\(moura2021lean\), Isabelle\(paulson1994isabelle\), Rocq\(hugo\_herbelin\_2026\_19256047\)\) have begun to streamline formal verification, helping not only scale proof verification but even enabling more scalable proof writing across large teams\(CommelinTopaz2024\)\. Yet, writing proofs in such formal languages is non\-trivial and often non\-intuitive\(bayer2024proof\)\. AI for formalization promises to massively scale up formalization of human\-written proofs\(yang2026formal\)\.

Rather than focus on benchmarking improvements in AI capabilities, we focus squarely on the interplay between people and AI systems\. Specifically, we take initial steps to characterize in a more naturalistic setting people’s preferences for AI integration into theorem\-proving workflows and how they actually construct such workflows and allocate agency in practice\. We do so via a mixed methods analysis, coupling a qualitative survey with a more controlled user study\. We conducted a survey of mathematics students and researchers’ preferences and self\-reported usage of their AI integration: where they want AI integrated, what they feel are the blockers to those visions presently, and where they donotwant AI to be integrated\. The survey was run in late 2025 giving us a glimpse – before the surge in agentic tooling – of where people’s preferences lay on a spectrum of being willing to delegate to versus control AI use in their workflows and how they view the state \(at the time of the study\) of AI tools to help fulfill such visions\. To understand how people actually enact such workflows in practice, we designed and ran a controlled user study, in which participants were given math problems \(at various degrees of difficulty, from simple number theory problems, to problems from topology\) and their proofs expressed in informal mathematics and tasked with formalizing the proofs in Lean, either with or without access to AI\-based tools\. We observed that participants generally attained higher formalization accuracy despite limitations of the tools at the time, achieved in part by many participants flexibly designing multi\-tool workflows\. Our work underscores the importance of studying not just capabilities but also people’s preferences for workflows, what they think barriers to those workflows are, and how they actually use tools in practice relative to those perceived and actual barriers\.

![Refer to caption](https://arxiv.org/html/2606.04273v1/x1.png)Figure 1:Survey responses around current and envisioned AI use\.a,Respondents’ self\-reported AI usage and whether or not they contributed to the mathlib library \(one indication of their experience/engagement with the Lean community\);b,Number of survey respondents indicating preference for certain types of human and AI involvement across different aspects of formalization workflows\. Bubble sizes indicate number of respondents specifying they would like that particular kind of involvement and what kind, if any, of AI involvement they want\. The human involvement coding is over the theorem proving and formalization process overall; AI involvement is then coded more granularly over what parts of theorem proving and formalization workflows respondents specifically indicated wanting \(or not wanting\) AI\. Note, survey responses were coded based on what participants said \(they may fall into other categories that they did not report; see Methods\)\.c\-d,Proportion of survey respondents indicating a given factor in their response, for what they \(c\) liked and \(d\) were frustrated by in AI tools \(at the time of the study\)\. Responses are broken down by self\-reported AI usage, and those who did not respond meaningfully to that question were excluded \(of the3131respondents,2424responses were coded as meaningful inccompared to2727ford\)\.## Results

![Refer to caption](https://arxiv.org/html/2606.04273v1/x2.png)Figure 2:Survey responses\.Full set of coded survey responses\. Each column represents one respondent\. A cell is colored if the respondent mentioned that factor in their response\. Labels above the columns indicate the self\-reported AI use of the respondent \(e\.g\., that they “Rarely” used AI in their work\)\. At the bottom, respondents’ self\-reported mathematics experience, formalization experience, and whether or not they have contributed to Mathlib\.##### People expressed heterogeneous preferences over AI use in mathematical workflows, but generally shared a desire to maintain high\-level control\.

Among the3131survey respondents111We make all anonymized survey responses available at the following[webpage](https://collinskatie.github.io/hai_formalization.github.io/)\.who passed our quality control filters, self\-reported AI use varied widely and was minimally influenced by whether the respondent was an active Lean community contributor, as measured by whether they contributed to mathlib \(Figure[1](https://arxiv.org/html/2606.04273#S0.F1)a\) or by how much prior mathematics and formalization experience they had \(see S\.I\.[A2](https://arxiv.org/html/2606.04273#S2)\)\.

Respondents expressed different preferences for where and how they wanted to have AI involved in mathematical practice; however, most respondents preferred workflows where AI did not fully automate mathematics \(see Figure[1](https://arxiv.org/html/2606.04273#S0.F1)b\)\. While many respondents wanted to automate core aspects of the formalization process \(70\.3%\), far fewer mentioned AI use in proof discovery \(40\.7%\) and of those who did mention AI in the formalization process, only 33\.3% indicated a desire for all aspects of the formalization process to be automated by AI\. Rather, a majority of respondents \(81\.7%\) indicated a preference for full or at least partial human control over the process \(14\.8% indicated they would never want to use AI; 66\.7% indicated a preference to maintain high\-level creative and/or strategic control over the process when engaging with AI; see Figure[1](https://arxiv.org/html/2606.04273#S0.F1)b and S\.I\. Table[2](https://arxiv.org/html/2606.04273#S2.T2)\)\. We observe that this trend persisted across participants with different levels of AI usage and prior formalization experience \(see Figure[2](https://arxiv.org/html/2606.04273#Sx1.F2)\)\.

##### People use AI despite raising reliability concerns\.

Our survey characterized both respondents’ visions for future workflows – and how they actually found AI tools, at the time\. While those with moderate to high use of AI generally liked that it could bring efficiency gains and reduce grunt work \(Figure[1](https://arxiv.org/html/2606.04273#S0.F1)b\), the dominant frustration was around lack of reliability \(Figure[1](https://arxiv.org/html/2606.04273#S0.F1)c\)\. Notably, many of the people raising concerns about reliability are the people who are still choosing to use AI \(73\.3% of respondents who indicate they sometimes or always use AI expressed frustrations at the lack of reliability\)\. Additionally, a non\-negligible portion of respondents \(18\.5%\) mentioned formalization\-specific frustrations, e\.g\., over the ability of AI systems they used to match their intended style of formal code as well as general frustrations of lack of integration into their specific formalization workflows \(see Table[2](https://arxiv.org/html/2606.04273#S2.T2)\)\. This suggests a ripe opportunity to build more tailored AI systems for the formalization community and expand evaluation efforts beyond formal correctness\.

![Refer to caption](https://arxiv.org/html/2606.04273v1/x3.png)Figure 3:Aggregate problem solving descriptive analyses, with and without tool access, and formalization workflows\.a,Average accuracy \(both statement and proof formalized correctly, see Appendix\) across problems, for groups with and without tool use;b,Average time taken \(in minutes\) across problems\. Error bars depict standard error;c,Types of tools used by participants and estimated number of unique tools used by each participant\. Each row is one participant\.;d\-fDifferent workflow patterns observed among participants:d,human formalizers with AI assistance;e,AI formalization with human help;f,varied human\-AI synergistic formalization\.Table 1:Participant self\-reported tool usage\. After the user study, participants filled out a questionnaire\. Participants were asked to indicate which, if any, AI\-based tools they used in their formalization and how they used them\. We include full participant responses here\. Participants did not always fully report all tools they actually used in the study \(see Methods and S\.I\.[A4\.4](https://arxiv.org/html/2606.04273#S4.SS4)\)\. Tool names are colored by category, matching Fig\.[3](https://arxiv.org/html/2606.04273#Sx1.F3)d:Inline LLM,Chatbot,Semantic Search,LLM \(Other\)\. Note, in some instances, e\.g\., the participant in the second to last row, they mention two uses of GitHub Copilot the span our colorings\.
##### Human\-AI collaboration tends to increase formalization accuracy, with a mixed effect on time\.

The survey captured respondents’ visions for future human\-AI formalization workflows and current frustrations hindering the implementation of this vision\. But what do people do in practice? Do AI tools for formalization make people more effective at formalization? And how, if at all, are they choosing to use AI to compensate for the kinds of limitations of current tools?

To take initial steps towards these questions, we ran a controlled user study\. Seven participants, who ranged in formalization experience but all had at least a basic understanding of Lean and had taken at least one undergraduate level mathematics course \(see Methods\), were given informal proofs and tasked with formalizing both their statements and the full proof\. They had six problems to formalize and were required to formalize three of the assigned problems on their own \(no tools\) and three with any tools they would like \(see Methods\)\. Participants’ formalizations were generally more accurate when allowed tool use than when not allowed tool use \(Figure[3](https://arxiv.org/html/2606.04273#Sx1.F3)a\)\. In a linear mixed\-effects model with crossed random intercepts for participant and problem \(correct ~ tools \+ \(1\|participant\) \+ \(1\|problem\)\), tool access increased the probability of a correct response by0\.300\.30points \(95% CI:\[0\.10,0\.50\]\[0\.10,0\.50\];z=2\.94z=2\.94,p=0\.003p=0\.003;n=42n=42trials from77participants\)\. Random\-intercept standard deviations were0\.200\.20for participant and0\.280\.28for problem, indicating that problem difficulty contributed more variance than participant skill\. Indeed, some problems proved highly challenging even with AI; no participant solved “Problem G” \(see Appendix Figure[12](https://arxiv.org/html/2606.04273#S4.F12)\)\. AI assistance did not have a statistically significant effect on time \(Figure[3](https://arxiv.org/html/2606.04273#Sx1.F3)b\) and was highly dependent on the problem \(Appendix Figure[9](https://arxiv.org/html/2606.04273#S3.F9)\)\. A similar linear mixed\-effects model \(time ~ tools \+ \(1\|participant\) \+ \(1\|problem\)\) reveals no significant effect of tool access on time \(6\.616\.6195% CI:\[−43\.62,56\.83\]\[\-43\.62,56\.83\];p=0\.797p=0\.797;n=41n=41222One participant failed to share their video recording for one problem, hence we do not have a timestamp for that video\.trials from77participants\)\.

##### Most participants flexibly constructed their own multi\-tool workflows\.

As in the survey, participants expressed in a post\-questionnaire a range of limitations of the AI tools available at the time of the study \(see S\.I\.[A5](https://arxiv.org/html/2606.04273#S5)\)\. Participants tended to compensate for the current limitations of any one AI tool in at least two ways: using multiple AI tools and exerting prudence over AI output\. Six of the seven participants opted to use more than one AI\-based tool \(Figure[3](https://arxiv.org/html/2606.04273#Sx1.F3)c\) and expressed deliberately tailoring their choice of tool to the task at hand \(Table[1](https://arxiv.org/html/2606.04273#Sx1.T1)\)\.

Additionally, we conduct preliminary qualitative exploratory analysis on some of the recorded videos \(see Methods\) to begin to understand the actual behavior participants engage in when interacting with AI\-based tools\. We observe regular engagement by participants in deliberate judgment of whether to accept or reject individual suggestions \(see examples in S\.I\.[A4\.4](https://arxiv.org/html/2606.04273#S4.SS4)\)\. Of the videos inspected, we observe that most participants engage in the kind of behavior respondents of the survey indicated a preference for, i\.e\., humans maintaining strong strategic guidance over the direction of the proof\. Several participants \(N=5N=5\) engaged in most of the formalization themselves, having AI only do light autocomplete or help with lemma retrieval \(we may call this group “human formalizers with AI assistance”; Figure[3](https://arxiv.org/html/2606.04273#Sx1.F3)d\)\. Yet, like in the survey, we observe diversity in the actual kind of instantiated workflows\. In contrast, another type \(one participant\) heavily used AI, only occasionally editing the proof manually \(“AI formalization with human help”; Figure[3](https://arxiv.org/html/2606.04273#Sx1.F3)e\), which may more directly align with the autonomous AI desired \(and at most, human verify\) subset from the survey\. This participant indicated limited experience in Lean and mathematics and only correctly solved one of the six problems \(in the subset where AI\-use was permitted\)\. Lastly, one participant’s workflow was more nuanced and intricate, involving heavy and varied AI use yet with extensive human engagement \(Figure[3](https://arxiv.org/html/2606.04273#Sx1.F3)f\)\. This participant exemplified a more intimate “hybrid human\-AI formalization\.” They utilized many different tools across different tasks \(see first row of Table[1](https://arxiv.org/html/2606.04273#Sx1.T1)\)\. For instance, in formalizing one problem, the participant used one tool \(Lean Copilot\) for suggesting lemmas, GitHub Copilot for filling in boilerplate parts of the proof, and Claude for rewriting part of the code in a different style\. While this participant may have embodied the kind of pattern some respondents indicated \(high\-level control\), the actual process required substantial switching between tools, which may not meet the desire for ease\-of\-use expressed by several respondents\. We provide additional descriptions of example usage patterns from these initial exploratory analyses in S\.I\.[A4\.4](https://arxiv.org/html/2606.04273#S4.SS4)\.

## Discussion

Effective AI use in practice depends not only on model capabilities, but how people choose to use them\. Across both the qualitative survey and user study, we saw a general preference to maintain higher\-level strategic control over the process of formalization\. In the user study, most participants – even when using AI – formalized substantial parts of the proofs themselves, relying on AI for lower\-level mechanical tasks, but also continuing to engage in low\-level formalization themselves\. Despite the limitations of the tools of the time, many participants still found ways to attain strong performance with them\. In particular, we observed that many study participants opted to use a suite of tools, tailored to particular parts of the formalization process\. Likewise, participants liberally and deliberately invoked control over the suggestions provided by in\-line coding tools, rejecting or deleting suggestions deemed inappropriate for the context\.

These observations underscore a core human faculty: the ability to flexibly adapt to new tools\(allen2020rapid\)\. When assessing the impact of AI integration in human workflows – we need to understand not only what the capabilities of tools are, but how people are and will adapt to them\. Both the survey and user study were conducted before the more recent surge of popularity in agentic coding systems\(li2025aiteammates\), like Claude Code\(anthropic2025claudecode\)or Codex\(openai2025codex\)\. That we carried out the study before the rise of agentic coding systems is in some ways an advantage as it allowed us to study in detail how participants connected \(manually\) different tools to form workflows, which increasingly agentic coding systems could autonomously form\. Our study offers a moment\-in\-time snapshot of initial use behavior and preferences that we cannot roll back the clock to recollect\.

Yet, our study and analyses are limited by several factors\. Both our survey and study involved relatively few participants and captured only a small subset of the broad space of possible mathematics students and researchers\. Respondents to the survey were primarily recruited through mailing lists and the social network X, which included a diverse set of users ranging from undergraduates to professional mathematicians\. A more systematic survey across levels of experience \(in proof discovery and formalization\) and the domain of mathematics is warranted\. Additionally, our user study consisted primarily of post\-graduate or postdoctoral level researchers, not expert mathematicians\. We saw, at the time of the study, dramatic differences in people’s willingness to use AI \(in the survey and study\) and the range and type of tools they engaged with if choosing to use AI — how much variability is there across a wider range of the mathematics community, and how do choices influence one another?

Our user study also focuses on the formalization of given informal proofs, not the actual process of proof discovery\. It is also worth noting that our instruction allowing tool use during one of the weeks may have biased participants to use AI when they otherwise may have chosen not to\. It is an open and important question to more systematically understand how AI access impacts a wider gamut of mathematicians’ workflows\(frieder2024data\)and over the kind of longer time horizons real human thought partners engage in\(collins2026meaningful\)\. In addition to valuable new kinds of evaluations designed to probe research\-level mathematics capabilities of AI systems, e\.g\.,\(abouzaid2026proof\), we also need more structured evaluations – a mix of qualitative case studies, of the kind in conducted in other recent works\(feng2026autonomousmathematicsresearch;bubeck2025early\)and more controlled \(and scaled\) user studies of the kind we implement here\. It is also important to move beyond just assessing accuracy\. Many participants were sensitive to style and expressed desires to preserve that aspect, while automating more mechanical grunt work\. This sentiment aligns with several other expert mathematicians who have shared their views on the future of mathematics, at the time of writing\(tao2025machine;avigad2026mathematiciansageai\)\.

Additionally, while we were able to, with consent, collect highly rich process\-level data into participants’ formalization and human\-AI interaction behavior in the user study, careful coding of such data requires substantial analytical overhead, which we have only scratched the surface of \(and are actively exploring in ongoing work\)\. More scalable human\-AI interactive video analyses, e\.g\., drawing on other work in human\-AI interaction, are important next steps\(mozannar2022reading\), not just for mathematics but the wider range of domains where we want to understand peoples’ AI workflows\. Here, we focus on the recordings where participants interact with AI in the tool use week, analyzing the human\-only week videos can also motivate entirely new AI\-assisted workflows and inform work understanding human cognition and what may make formalization challenging\.

What kinds of human\-AI workflows might we expect people to engage with and create as AI tools continue to advance at a rapid rate? In many ways, mathematics is so much more than what is proved or formalized, sitting partway between engineering, science, and artistic activities\(hadamard1954essay;hardy1992mathematician;newellCreative\)\. If a core part of mathematics by people for people is human understanding and the social relations that flourish around mathematics\(thurston1995proof\), then AI tools which can automatically check proofs, reduce grunt work, and open up new kinds of discoveries could lay the seeds to advance human understanding\. But in that quest, it is exciting – and deeply important – to continue to elicit and evaluate, people’s preferences and practice for how to build and adapt new workflows, alone and together, for this evolving era of mathematics\.

## Materials and methods

### Survey

##### Design

We designed a survey to better understand a wider range of people’s current and preferred modes of interacting with AI as part of proof formalization and discovery\. The survey was designed and conducted after the human\-AI formalization user study \(described below\)\. We based our questions off of the post\-survey participants in the user study answered, with modifications for clarity\. We include the full questionnaire in S\.I\.[A6](https://arxiv.org/html/2606.04273#S6)\. The study received prior ethics approval from the University of Cambridge Department of Engineering ethics review\. All respondents provided informed consent before submitting their survey\.

##### Recruitment

Participants were recruited via X, mathematics department mailing lists, and manually writing to several professors who teach classes on Lean\. This survey was active between October 27, 2025, and November 25, 2025\.

##### Quality Control

Due to the wide recruitment, we received responses from several people who indicated no prior experience with, or understanding of, formalization\. As our goal was to study AI use and preference around mathematical formalization, we removed any participant \(99of4040who responded\) who indicated they had no experience at all with formalization or did not even know what formalization is, or did not provide substantial responses for the open\-ended questions\. Some respondents provided textual responses for only a few of the open\-ended questions\. For respondents who provided genuine responses for at least some questions, we exclude them from the analysis \(listed as N/A\) for questions for which they did not respond\.

##### Analysis Methodology

We focus our analyses on five primary open\-ended questions, which captured what respondents liked about current AI \(Q15\), what they were frustrated by \(Q16\), what they most wanted in the next\-generation of AI \(Q17\), and what they envisioned for future workflows \(Q18\-19\)\. Full questions are included in S\.I\.[A6](https://arxiv.org/html/2606.04273#S6)\. Coding of these open\-ended text responses to the survey was conducted across multiple phases\. First, one author read through all responses and came up with an initial coding\. This coding was discussed and iterated on with three other authors \(Claude Code was also consulted\)\. One author then manually coded all responses based on these codes\. Codes were then revised by these four authors with the goal of having a sufficiently rich yet relatively simple set of codes\. Coding was then updated and discussed by said authors\. Four other authors \(not consulted in the initial coding\) then coded responses\. The original four authors then reconciled coding differences through discussion\. The survey responses sometimes required inductive leaps of intent, hence, the extensive multi\-author coding process – which still involves subjective judgment\. We make all survey responses and codings available at the following[webpage](https://collinskatie.github.io/hai_formalization.github.io/)\.

We also analyze and decompose responses based on self\-reported categorical features \(e\.g\., AI use, mathematics experience\) and conduct exploratory analyses into self\-reported tool use \(S\.I\.[A2](https://arxiv.org/html/2606.04273#S2)\)\.

### User Study

##### Problems

We curate a selection of six problems \(identified by the letters*C*,*N*,*A*,*V*,*G*,*T*, which loosely correspond to their mathematical domain, e\.g\.,*N*represents*number theory*; see Appendix[A3\.1](https://arxiv.org/html/2606.04273#S3.SS1)\)\. This relatively low number of problems is necessary to gather sufficient data for each problem and keep formalization time within reasonable bounds\. Hence, we carefully source problems so that they cover a range of domains, mathematical difficulty, and formalization difficulty \(since a problem being easy mathematically does not imply easy formalization\), as well as contamination \(to test whether participants can uncover the publicly known solution\)\. More details on problem selection are included in Appendix[A3\.1](https://arxiv.org/html/2606.04273#S3.SS1)\.

##### Participants

The user study was active between May 26 and Aug 11, 2025\. Eight participants originally completed our study; however, we excluded one participant who failed to record their videos separately\. Participants were recruited to have at least a baseline level of Lean experience, e\.g\., engaged with at least a tutorial like the Natural Numbers Game\(bayer2025studying\)\. All participants had taken at least one undergraduate mathematics level course; most had post\-graduate level mathematics experience \(see Appendix[A3](https://arxiv.org/html/2606.04273#S3)\)\. In their post\-survey, participants indicated their prior level of Lean experience; participants happened to be split roughly evenly along those who categorized themselves as “Beginner” \(N=3N=3\) versus “Advanced” or “Expert” \(N=4N=4\) formalizers\. Participants were given six informal mathematical problems, together with their proofs, to formalize\. Problems were split into two groups of three, matched by formalization difficulty across groups\. Participants were given two weeks to formalize the problems and proofs; participants are instructed to formalize a group of three problems and proofs in week one, and the other group of three in week two\. Participants’ tool access differed across weeks\. In one week, participants were not allowed to access any tool333Participants were still allowed a text editor \(e\.g\., VSCode\) and Lean compiler\.\(the “human alone” condition\)\. In the other week, participants were allowed access to any AI\-based online resource\. This included all standard AI tools such as ChatGPT, Claude, GitHub Copilot as well as mathematics specific tools like LeanSearch for finding mathlib theorems and Lean Copilot for more tailored Lean\-specific suggestions in VSCode\. Participants were randomly assigned whether to use AI\-assistance in week one or two\. We asked that participants record their screens while working and send us the recordings\. The resulting study includes over8080hours of participants’ screen\-recorded formalization processes\. Additional details are in Appendix[A3\.2](https://arxiv.org/html/2606.04273#S3.SS2)\.

##### Analysis Methodology

Our primary measures are formalization accuracy and time\-to\-formalize\. Three experienced Lean users from our author team manually grade the accuracy of participants’ statements and proofs; one grader marked each problem\. Responses are scored with a binary correct or not based on whether both the statement and proof are correct\. This is inherently subjective in our current definition of accuracy; we are actively working on expanding our accuracy evaluation for future work\. We include an initial more granular analysis of correctness in Appendix[A4\.2](https://arxiv.org/html/2606.04273#S4.SS2)\. All our formalizations will be made publicly available upon full publication\. Additionally, time is measured directly from the videos; participants were asked to record their entire problem solving session, which often unfolded over one or more sessions per problem\. Time\-to\-formalize is summed over all sessions per problem\. We also take steps to assess which AI\-based formalization tools people used based on their self\-report and initial exploratory analyses into a sampling of the live recorded videos\. Where participants used an AI tool that they did not self\-report, we did include that in the tools used as part of coding Figure[3](https://arxiv.org/html/2606.04273#Sx1.F3)c\-d\.

## Acknowledgments

We thank Ced Zhang, Albert Jiang, Tim Gowers, Mateja Jamnik, Kaiyu Yang, Phoebe Zeng, Eric Horvitz, and Hussein Mozannar for valuable conversations that informed this work\. We also thank Jeremy Avigad for kindly helping circulate our studies for recruiting participants\. KMC acknowledges support from the NSF SBE SPRF, the Cambridge Trust, and King’s College Cambridge\. AW acknowledges support from a Turing AI Fellowship under grant EP/V025279/1, The Alan Turing Institute, and the Leverhulme Trust via CFI\. JBT acknowledges support from the AFOSR \(FA9550\-22\-1\-0387\), the ONR Science of AI program \(N00014\-23\-1\-2355\), a Schmidt AI2050 Fellowship, and the Siegel Family Quest for Intelligence at MIT\.

## Participant Contributors

We thank all participants of our study\. The following participants opted to be listed as contributors: Xavier Lien, Cayden Codel, Mauricio Barba, and Anand Tadipatri, and Fabian Zaiser; the others elected to remain anonymous\. Fabian joined our author team after submitting his study to help with analyses and the separate qualitative survey\. We also thank all survey respondents\.

## AI Disclaimer

AI coding tools were used to help with data analysis figures\. We also used Claude to help come up with the survey coding, as described in the Methods under Analysis Methodology\. We take responsibility for all results and analyses presented in this work\.

## References

## Appendix

## A1Additional Related Work

##### AI for Formalization

Many recent formalization systems use LLMs to provide varying degrees of automation to users\. Interactive tools range from retrieval\-augmented generation \(RAG\) pipelines such as ReProver\[NEURIPS2023\_44414694\]that select relevant Mathlib premises and suggest tactics to the user, to assistants such as Lean Copilot\[song2024lean\]and LeanProgress\[george2025leanprogress\]that provide tactic suggestion, proof search, premise selection, and proof progress prediction natively in the editor\. A greater degree of automation is provided by systems such as DeepSeek\-Prover\-v2\[ren2025deepseekproverv2advancingformalmathematical\], Kimina Prover\[wang2025kimina\], or Seed\-Prover\[chen2025seedproverdeepbroadreasoning\]that are trained via reinforcement learning that aim to one\-shot formalizations more accurately or make use of a range of tools to improve formalization quality\. Large upstream corpora such as NuminaMath \(approximately 860k competition\-style problems with chain\-of\-thought solutions\) and the Lean\-specific NuminaMath\-LEAN \(approximately 100k human\-annotated Lean 4 statements and proofs\) increasingly train math LLMs and Lean provers, including Kimina\-Prover, so they help contextualize the capabilities available to participants in our study\[li2024numinamath,numinamathlean2025,wang2025kimina\]\. Complementing prover systems, process\-driven autoformalization in Lean 4 introduces the FormL4 benchmark and a process\-supervised verifier that leverages Lean 4 compiler feedback, and Lean Workbook contributes about 57k natural language\-Lean pairs via an iterative translation\-and\-filtering pipeline checked by the Lean compiler\[forml4,leanworkbook\]\.

##### Interactive Theorem Proving User Studies

A body of work from the 1990s and early 2000s studied how mathematicians engage with interactive theorem proving workflows for formalization\[AITKEN1998263,merriam1996evaluating,merriam1996two,kadoda2000cognitive\]\. For instance, in Merriam and Harrison\[merriam1996evaluating\]characterized four key activities that people engage in when formalizing: planning \(designing the high\-level sketch of the proof\), reusing \(leveraging previously written code and writing code strategically with reuse in mind\), reflecting \(evaluating what has been written and further understanding the proof goal\), and articulating \(actually specifying commands to the prover assistant\)\. As we explore in our work, these four strategies can each be modularly engaged with AI\-based tools for formalization \(effectively or not\)\. Additional user studies into tools for formalization have been conducted since, e\.g\., in de Almeida et al\.\[de2025lessons\], the authors analyze a wide range of Rocq \(formerly Coq\) users’ survey responses about their use behavior, finding marked differences in styles and preferences around formalization depending on the level of experience of the user\. Most closely related to our work, Shi et al\.\[shi2025qed\]conduct an observation study formalizing proofs in Lean and Rocq; this work offers a rich, in\-depth look at modern formalization practice\. The authors observe varied tool use, similar to the kind engaged in our study, and also notice that several users care about features beyond just correct formalization – a pattern we noticed in our study as well \(e\.g\., some participants tried to write their proofs to use particular libraries, or commented on how “ugly” they found their code\)\. Most differentially from us, they do not substantively investigate peoples’ interaction with AI systems as part of their formalization, in contrast to the focus of our work\. More recently, several frontier labs have collaborated with mathematicians for deep dive analyses into their experiences working through one or more problems with that lab’s AI systems\[bubeck2025early,zheng2026ai\]\. We see this style of work as complementary to our more structured user study and wider ranging preference elicitation \(e\.g\., including but not limited to earlier career mathematicians\)\.

##### Human\-AI Interaction for Programming

A fairly extensive set of user studies have been, and continue to be, conducted to understand how developers code with various AI copilots\. These studies have focused on two forms of assistance that AI copilots provide: autocomplete suggestions\[vaithilingam2022expectation,peng2023impact,barke2022grounded,prather2023its,mozannar2022reading,vasconcelos2023generation,cui2024productivity,mozannar2024realhumaneval\]and chat dialogue\[ross2023programmer,chopra2023conversational,kazemitabaar2023studying,gu2023analysts,nam2024using,mozannar2024realhumaneval\]\. These studies show how copilots have generally had a positive impact on software development, e\.g\., leading to an increase in perceived productivity\[ziegler2022productivity,10\.1145/3706599\.3706670\]and rate of task completion in controlled studies\[vaithilingam2022expectation,peng2023impact\]compared to developers writing code on their own\. More recently, moving beyond copilots for software development, we have seen the introduction of coding*agents*\(e\.g\., Devin\[devin\], OpenHands\[wang2024openhands\], Claude Code\[claudecode\]\)\. A growing set of work measures the impact of these more autonomous tools on developer workflows\[impact\_software\_development,chen2025code,becker2025measuring,chen2025can\]\. In our study, we also observe how these AI coding assistants are also increasingly integrated into mathematicians’ proof formalization workflows\.

## A2Additional Survey Analyses

Table 2:Example survey responses around envisioned formalization workflows\.Each row is one survey respondent\. Responses were chosen to highlight variation across participants in some of the richer responses\.Table 3:Example survey respondents’ self\-reported tool usage\. Survey respondents were asked whether they used any of the AI tools that participants reported in the user study\. We include several example responses\.![Refer to caption](https://arxiv.org/html/2606.04273v1/x4.png)Figure 4:Self\-reported AI tool use from qualitative survey\.Respondents were prompted with example tools \(taken from what participants in the user study used\) and did not necessarily describe all tools they use in practice;1818specified at least one specific tool\.a,Counts of tools expressed in respondents’ descriptions when asked whether they used any of the same tools as participants in the user study \(and to write in other tools they may use, if any\)\.b,Number of distinct tools mentioned by each of the1818respondents who specifically named at least one explicit AI\-based tool that they use\.We next include additional insights into the user study\. Table[2](https://arxiv.org/html/2606.04273#S2.T2)shows several example survey responses\. Full responses can be found at[webpage](https://collinskatie.github.io/hai_formalization.github.io/)\. While we did not elicit an exhaustive set of kinds of AI tools respondents used, as in the user study, multiple respondents also indicated several different tools \(Table[3](https://arxiv.org/html/2606.04273#S2.T3)\)\. One author \(working with Claude Code\) annotated the tools participants expressed444The codings are fairly coarse, as the diversity of tools continues to grow and encapsulate multiple subvariants\. One participant noted they used “Copilot\-GPT\-Codex\.” We coded this as GitHub Copilot\. The most common tools again were GitHub Copilot and ChatGPT \(Figure[4](https://arxiv.org/html/2606.04273#S2.F4)\); however, we primed respondents with example tools \(see S\.I\.[A6](https://arxiv.org/html/2606.04273#S6)Q14\)\. Even though it was not listed in our seeded example tools in the question, four respondents indicated they used Claude Code\.

Additionally, we include a full breakdown of codings of responses based on respondents’ self\-reported experience and demographic data, i\.e\., their AI usage \(Figure[5](https://arxiv.org/html/2606.04273#S2.F5)\), whether they have contributed to mathlib \(Figure[6](https://arxiv.org/html/2606.04273#S2.F6)\), their mathematics experience \(Figure[7](https://arxiv.org/html/2606.04273#S2.F7)\), and specifically their formalization experience \(Figure[8](https://arxiv.org/html/2606.04273#S2.F8)\)\. For each, we depict codings for what respondents’ expressed liking, being frustrated by, most wanting next for AI tools and what workflows they envisioned\.

![Refer to caption](https://arxiv.org/html/2606.04273v1/figures/decomp_by_ai.png)Figure 5:Survey results, decomposed by self\-reported AI usage\.![Refer to caption](https://arxiv.org/html/2606.04273v1/figures/decomp_by_mathlib.png)Figure 6:Survey results, decomposed by whether respondent indicated they had previously contributed to mathlib\.![Refer to caption](https://arxiv.org/html/2606.04273v1/figures/decomp_by_math.png)Figure 7:Survey results, decomposed by self\-reported respondent general mathematics experience\.![Refer to caption](https://arxiv.org/html/2606.04273v1/figures/decomp_by_form.png)Figure 8:Survey results, decomposed by self\-reported respondent formalization experience\.
## A3Additional Details on Study Design

We next provide additional details on study design, namely problem construction, participant recruitment, and problem grading\.

### A3\.1Additional Details on Problem Construction

Problems were selected and constructed by our author team\. All problems have sufficiently short formal proofs that a team of pilots was able to formalize all six of them in approximately twelve hours\. This team was made up from the authors of this study, and formalizers all had at least a first degree in mathematics, or were at most at the level of a Post\-Doc; all had significant prior experience in formalizing\.

Formalization difficulty comes in two parts: the intrinsic difficulty of formalizing a problem; and whether Mathlib provides pre\-existing definitions and useful lemmas to carry out formalization; or, at its extreme, contains an approximate version of the sought proof\. In the latter case, simply adapting the key parts of the known formal proof to transform it to the sought proof is much easier than formalizing from scratch\. We also account for this possibility and include both problems where a similar formalization exists, as well as problems where a similar formalization does not exist\.

We provide an overview over the problems used in our study, including their statements as they were given to participants\. For information regarding how participants performed on each problem, we refer to Section[A4\.2](https://arxiv.org/html/2606.04273#S4.SS2)\.

- •Problem N:This is a simple number theory problem about solving a congruence equation\. The expected format of the solution is deliberately open\-ended to encourage participants to present their answers in diverse ways\. Problem Statement:*Find all solutions of the congruence12x≡9\(mod15\)12x\\equiv 9\\pmod\{15\}\.*
- •Problem A:This is an analysis problem involving continuous and differentiable functions and the use of the mean value theorem\. The problem is exactly Cauchy’s mean value theorem, although this information was not given to participants\. Cauchy’s mean value theorem has been included in mathlib since 2019\. Problem Statement:*Letf,g:a,b\]→ℝf,g:a,b\]\\rightarrow\\mathbb\{R\}be continuous and differentiable on\(a,b\)\(a,b\)\. Then there existsx∈\(a,b\)x\\in\(a,b\)such that* \(\(f\(b\)−f\(a\)\)g′\(x\)=\(g\(b\)−g\(a\)\)f′\(x\)\.\(\(f\(b\)\-f\(a\)\)g^\{\\prime\}\(x\)=\(g\(b\)\-g\(a\)\)f^\{\\prime\}\(x\)\.
- •Problem C:This is a counting problem about integer sequences defined by a self\-referential rule involving divisibility\. After classifying the solutions, one must apply foundational results about finite cardinalities, which can be cumbersome as it requires checking obvious non\-equalities\. Problem Statement:*We call a sequencea1,a2,…a\_\{1\},a\_\{2\},\\ldotsof non\-negative integers delightful if there exists a positive integerNNsuch that for alln\>Nn\>N,an=0a\_\{n\}=0, and for alli≥1i\\geq 1,aia\_\{i\}counts the number of multiples ofiiina1,a2,…,aNa\_\{1\},a\_\{2\},\\ldots,a\_\{N\}\. How many delightful sequences of non\-negative integers are there?*
- •Problem G:This is a geometry problem, solved primarily using angle chasing\. Part of the solution involves constructing an auxiliary point along an interval that creates a specified angle with another point, which seems “obviously" valid to a human mathematician\. Much of the difficulty in formalizing comes from showing that such a point exists, which requires careful use of the intermediate value theorem, noting that the angle subtended varies continuously as a point moves along a line\. Problem Statement:*LetABCABCbe an isosceles triangle withAB=ACAB=ACand∠BAC=20∘\\angle BAC=20^\{\\circ\}\. LetGGbe on the segmentACACsuch that∠ABG=20∘\\angle ABG=20^\{\\circ\}\. LetHHbe on the segmentABABsuch that∠ACH=30∘\\angle ACH=30^\{\\circ\}\. Find∠AGH\\angle AGH\.*
- •Problem T:This is a topology problem involving Baire spaces, which requires unraveling definitions, and careful understanding of subspace topologies\. Problem Statement:*A topological spaceXXis said to be “Baire" if for any sequenceA1,A2,…A\_\{1\},A\_\{2\},\\ldotsof open dense sets inXX, the intersection⋂nAn\\bigcap\_\{n\}A\_\{n\}is dense\. SupposeXXis a Baire space andYYis an open subspace ofXX\. Prove thatYYis Baire\.*
- •Problem V:This is a simple visual counting problem involving tiling a 2x3 grid with dominoes\. The mathematical proof is simple; the difficulty comes from having to formally state this problem, which is cumbersome to do in Lean\. This problem also requires manipulation of finite cardinalities\. Problem Statement:*Consider a2×32\\times 3domino board\. You have two1×21\\times 2dominos and two1×11\\times 1dominos\. In how many ways can you cover the board?*

### A3\.2Additional Participant Details

We recruited participants through our networks through university mailing lists and personal contacts\. Participants were told the study would take approximately1212hours and were paid at a fixed rate of$240\\mathdollar 240\(for an estimated$20\\mathdollar 20/hr\)\. In case of questions regarding what consists of AI tool use, we worked with the participants to make sure our criteria were uniform across participants\. Participants provided self\-reports at the end of the study, indicating their experience with mathematics and formalization, as well as providing qualitative responses into their AI use \(see Section[A5](https://arxiv.org/html/2606.04273#S5)\. Participants also provided screenshots of any scratchwork used during their formalization process\. The study received prior ethics approval by our university departmental ethics review, and all participants provided informed consent\.

All participants noted that they used Lean at least once a month \(they were given the option to indicate less frequent use\)\. All participants indicated that they had “moderate comfort” with at least11formal proof language \(Four participants were comfortable with one language; two with two languages; and one participant with three languages\)\. Four participants indicated that they “always” used AI\-based tools when formalizing in their day\-to\-day workflows; two indicated they “sometimes” did; and one indicated “rarely” using AI\-based tools\.

### A3\.3Additional Analysis Details

Three experienced Lean users from our author team marked the accuracy of participants’ formalized proofs\.

![Refer to caption](https://arxiv.org/html/2606.04273v1/figures/time_by_problem.png)Figure 9:Time taken per problem\.Time to formalize each problem\. Error bars depict standard error\.We code tool use based on a mixture of participants’ self\-reported tool use \(see Table[1](https://arxiv.org/html/2606.04273#Sx1.T1)\) and notes from watching videos, as participants did not always report all tools they used\. While we have watched several hours of participants’ videos, and notice that most participants are consistent in their tool use across problem sessions, until we have completed a full coding of all8080hours of video, it is therefore possible that we missed some tool use\.

## A4Additional Analyses into Participants’ Formalization

We include additional exploratory analyses into participants’ formalization results\.

### A4\.1Formalization Decomposed by Experience

We also decompose participants’ formalization accuracy based on their self\-reported Lean experience \(Figure[10](https://arxiv.org/html/2606.04273#S4.F10)\) and mathematics experience \(Figure[11](https://arxiv.org/html/2606.04273#S4.F11)\. We caution over interpretation due to small sample sizes\.

![Refer to caption](https://arxiv.org/html/2606.04273v1/x5.png)Figure 10:Formalization performance by Lean experience\.a,Average accuracy based on whether participants self\-reported as being Beginners \(N=3N=3\) or Experts \(N=4N=4\) with Lean;b,Average time \(min\) across problems\. Error bars show standard error\.![Refer to caption](https://arxiv.org/html/2606.04273v1/x6.png)Figure 11:Formalization performance by mathematics experience\.a,Average accuracy based on whether participants self\-reported as having studied mathematics up to or through the postgraduate level \(N=5N=5\) or undergraduate level \(N=2N=2\)\. Accuracy is decomposed into the problems where participants did or did not have access to tools \(three problems per participant per category\)\.;b,Average time \(min\) across problems\. Error bars show standard error\. These analyses are descriptive and exploratory; we urge caution in any extrapolation due to the small sample size\.
### A4\.2Finer\-Grained Accuracy Evaluations

![Refer to caption](https://arxiv.org/html/2606.04273v1/x7.png)Figure 12:Accuracy per problem, for participants with and without tool use\.Accuracy is averaged per problem\. Error bars show standard error\. We caution over\-interpretation due to small sample sizes \(3−43\-4participants\) per bar\.Throughout the paper, we have focused on accuracy as measured by having both the statement and proof correct\. We report the average accuracy per problem in Figure[12](https://arxiv.org/html/2606.04273#S4.F12)\. However, a few participants were correct in their statement formalization but incorrect in the proof \(or vice versa – formalized the statement incorrectly, but proved the statement correctly under that interpretation\), or formalized only part of the proof\. E\.g\., one proof that was marked as incorrect did get some of the way there according to the grader on Problem N:

‘‘Only one direction \(12x≡9\(mod15\)⟹x=5n\+212x\\equiv 9\\pmod\{15\}\\implies x=5n\+2\) is formalized\. That direction is correctly formalized and proved\. ZMOD is used for modulus in the assumption,5n\+25n\+2is explicitly written as the answer\.’’

Whereas another participant \(for Problem G\) made a start at the problem but did not formalize the entire proof correctly according to the grader:

‘‘Formalised a lot, but with some sorrys \(beyond existence of D\)\. Noted that showing the existence of D ‘‘seems hard‘‘ ’’

We are actively expanding further exploratory analysis of a finer\-grained human evaluation of the proofs\. As before, three formal mathematics researchers from our team assessed each proof \(one per proof\)\. Future work can conduct a more expansive human audit of proofs\.

### A4\.3Analyzing Code Structure

We also conduct exploratory analyses into the style of code participants provided, depending on whether they had access to tools or not\. We observe that there may be some effect of AI\-based tool use on overall proof length, with participants potentially producing – on average – longer formal proofs with AI\-assistance \(Figure[13](https://arxiv.org/html/2606.04273#S4.F13)a\)\. However, this effect is variable at a per\-problem level \(Figure[13](https://arxiv.org/html/2606.04273#S4.F13)b\) and moderated by whether participants even finished the proof\.

![Refer to caption](https://arxiv.org/html/2606.04273v1/x8.png)Figure 13:Formalized proof length\.a,Average code length \(not including comments\);b,Average code length \(not including comments\) per problem\.
### A4\.4Analyses into Formalization Processes from Participant Videos

Participants recorded themselves formalizing proofs, producing over8080hours of videos\. Two authors from our team manually watched and annotated a selection of videos to understand when and how participants were using AI\. We are actively extending our analyses for more rigorous coding over our full library of formalization videos\.

Here, we offer a glimpse into the kinds of behavior different participants showed, with brief summaries of a few of the formalization videos\. These behaviors can be characterized in varying reliance strategies in a user’s AI workflow \(a la\[jorgensen2025documenting\]\)\. As described in the main text, one group of participants predominantly formalized the proofs themselves, only lightly engaging AI tools \(often, GitHub Copilot in\-line\)\. For instance, two participants who could be classified as “human formalizers with AI assistance” each sparingly used GitHub Copilot as part of solving Problem V\. One participant listed themselves as an expert formalizer in Lean; the other, as a Beginner Lean user\. Both participants used comments to “prompt” GitHub Copilot and regularly overrode Copilot suggestions\. One participant \(with Lean experience\) used Copilot in more of a coarse\-to\-fine fashion, working on some functions, templating them and using Copilot to fill in details, particularly with substantive and repetitive pattern matching\. Relatedly, the participant with less Lean experience also regularly rejected Copilot suggestions; they can be seen frequently pausing and deliberately reading the Copilot suggestions, before occasionally accepting them – suggestive of the intentional nature of their proof process\. Both participants solved Problem V correctly\.

In contrast, another participant could be categorized as behaving in a way of “AI formalization, with human help\.” This participant started one problem by opening up two chat windows: one ChatGPT and one Claude\. The participant then pasted in the problem PDF into each window, and proceeded to engage in a copy\-output/copy\-error message cycle into the models over the course of an hour\. While the participant shifted to using GitHub Copilot in the latter half of the proof, even there, the participant did not seem to engage in as much critical thinking\. There was a span of five minutes where the participant went back and forth with GitHub Copilot saying “finish this,” having the model generate a series of local code snippets \(the participant queried88times, accepting77of them outright\)\. This led to a substantial amount of the code not being generated by the participant; the participant seemed to recognize that the code did not match their style and noted in a comment at the top near the end of the video \(“i’m sorry this is so ugly”\)\. While this participant formalized the problem correctly, they took about3×3\\timesas long as the average participant for this problem \(“Problem N”\)\.

Lastly, one participant \(who used nine different tools over the course of the study\) was a highly varied user, engaging many different kinds of AI tools for different parts of their formalization process – yet demonstrated regular and substantial agency in how they used the tools \(e\.g\., regularly declined in use of AI\-based tools over the course of the problem\)\. For instance, on one problem \(“Problem N”\), this participant started by pasting the informal statement into Kimina to formalize, pasted into VSCode and manually edited\. They then went to look at the solution, opened GitHub Copilot chat to get a tailored response using a particular tool, querying “Define the set of all numbers of the form x = 5n \+ 2 for some integer n, in Lean, using ZMOD\.” They then further clarified their interest in ZMOD: “I’d like to use the ZMOD notation\. An example is: \[\.\.\.\]” and copied over the solution\. They then asked for “suggest tactics” from Lean Copilot, looked through and chose the second suggestion, and proceeded to again search for tactics \(this time choosing the third suggestion\)\. They then accepted a major block of GitHub\-generated code, before moving to Claude Sonnet to rewrite part of their code \(“I’d like to prove this just using the definition of ZMOD\. Could you rewrite this proof?”\)\. They then proceeded to delete a large block of AI\-generated code in their proof, went back to Kimina and had the model generate the full formal proof, given the informal, and made manual tweaks\. However,3030seconds later, the participant proceeded with a series of major deletions, essentially restarting the entire formalization process and wrote the rest of the proof primarily unassisted \(ending up with a much simpler proof that used theomegatactic\)\. This is a case where the participant heavily engaged AI, but ended up, in their final proof, relying very little on the actual AI\-generated code\.

## A5Additional Participant Self\-Reported Thoughts on AI\-Based Assistance for Lean

Participants were asked the following questions after their study \(in addition to the question of what tools they used, as presented in Table[1](https://arxiv.org/html/2606.04273#Sx1.T1)\)\. We include participants’ responses verbatim\.

### A5\.1“What do you like most about AI\-based tools for formalizing?”

- •“I like that AI tools can help me formalize "obvious" results that I have trouble formalizing on my own\. I believe that AI can also help familiarize me with new topics and good ways to formalize those topics\.”
- •“ The can \(often\) take care of tedious low\-level details\. They also offer better "library search" than Mathlib’s search function\.”
- •“When I’m certain of an approach, I like how AI\-based tools fill in the unimportant/rote details\. For example, if I need to open a file in Python, rather than looking up the syntax, I can just leave a comment that I want to open a file, and Copilot will fill in the code for me\. When it comes to Lean formalization, Copilot copies parts of proofs from elsewhere in the file, fetch names of lemmas from Mathlib, and fill in assumptions for lemmas, which saves me time\. I rely less on its proof completion, especially for nontrivial proofs\. But it works pretty well on straightforward induction proofs or simp\-filled proofs\.”
- •“ Discovering definitions and lemmas in the library is significantly easier thanks to AI\-based search and autoformalization tools\.”
- •“ I also find it helpful to load relevant Mathlib files into the GitHub Copilot chat context to ask questions that would otherwise take me a significant amount of time to find answers to\.”
- •“ I like that it helps to get me started to have a framework of understanding the problem and giving me ideas to approach the formalization, even if it is unable to give me the full thing\. It is also helpful in pointing me to other resources that I can reference”
- •“ leansearch gives me fast lookup with low cognitive overhead\. Chatgpt/Gemini sometimes are very good at pointing out mistakes that I’ve made in formulating the problem\.”
- •“ Using Copilot can save time\.”

### A5\.2“What frustrates you most about the current state of AI\-based tools for formalizing?”

- •“ Sometimes the code is wrong or doesn’t do what I want it to do \(this is sometimes fixed by giving it more prompts\)”
- •“Miscalibrated confidence in the answers\. Even if the AI "doesn’t know" it will output something plausible but wrong, which is more frustrating than not getting an answer at all\. Also AI\-based tools seem to be much worse for combinatorics and geometry than for algebra or analysis\.”
- •“When it’s wrong, it’s really wrong\. Sometimes Copilot suggests a \(broken\) "proof" of a theorem that will in reality turn out to be much longer or more complex\. This requires me to cancel the suggestion and fill in the proof myself \(which is what I would have done anyway, but the extra keystrokes to dismiss it can be distracting\)\.”
- •“I find that there aren’t enough tools specifically tailored to the Lean environment\. There’s been a lot of exciting progress in the AI for mathematics space in the recent years, but not all of that progress has translated into better tooling for the Lean community\. Even when tools exist, they can be difficult to get working, requiring a complicated set\-up involving external dependencies or an API key configuration \(although it’s completely understandable why this is the case\)\. An ideal situation would be to have a VS Code extension that can interact with the current Lean session and assist the user with various formalization tasks\. Another option would be to have AI\-based tools and tactics shipped by default with Mathlib or Lean, but this has the disadvantage of being "opt\-out" rather than "opt\-in"\.”
- •“There’s still quite a bit of hallucination, calling tactics that don’t exist for example\. Even giving it reference to the Mathlib documentation \(by copying and pasting\) didn’t really help”
- •“1\. Mathlib seems to evolve faster than models can keep up\. 2\. Chat models are often very confident about wrong things – even stuff like "here’s a much shorter simpler proof\!" and it’s actually longer and has error\. 3\. Models are expensive to use\. 4\. Stuff like AlphaProof is not available to the public\.”
- •“The practice of using such tools does not match strong results reported in research benchmarks; they are mostly unable to prove even very simple things, including when given access to the Lean goal state via MCP\. Also, most of the tools are challenging to set up for a local development environment\.”

### A5\.3“What do you envision any future workflow pattern for how you may go about discovery new proofs / formalizing them? Where would you want tools that help you? Where would you NOT want AI\-based tools to be used?”

- •“Better code generation, maybe also suggest more advanced techniques \(like using tactics\)”
- •“Reduction of "bullshitting" \(giving plausible but wrong answers\) when they "don’t know"\. Also, expansion of application domains to be more general than algebra and analysis\.”
- •“Automatically fill in lemmas/definitions as I work on a larger theorem\! For example, it would be great if I could leave a lemma with "sorry," and then while I use the lemma in the main theorem, Copilot \(or whatever tool\) could copy my file up to the lemma and work with it in the background until it finds a proof\. Right now, all interactions are human\-driven\. \(Of course, this will introduce more problems \- what if the definition/proof is too verbose, or not quite what I want\. Then I need to manually clean up, which is still faster than writing the proof myself, but can be more frustrating/tiring\. But if it’s tuned right, this would save time\!\)”
- •“Good autoformalization and proof completion models that integrate seamlessly into the Lean editor would be useful to have\.”
- •“A generally Lean\-aware chatbot that can alert the user to Lean formalization conventions \(like writing ‘a < b’ instead of ‘b \> a’\) explain how a mathematical concept is defined in Mathlib, and how it is meant to be used generate outlines for proof formalization suggest a convenient way to define a mathematical concept in Lean \(for example, by combining existing definitions in the library or by modelling it based on an existing definition of a similar concept\) suggest new lemmas to formalize suggest attributes to tag lemmas with \(like\[simp\],\[aesop\]or\[grind\]\) mention relevant lemmas, tactics or domain\-specific proof formalization strategies improve and optimize proofs give high\-level feedback on a Lean document \(like a Mathlib reviewer\) would make formalization significantly easier and more accessible, in my opinion\.”
- •“Being able to generate tactics that actually exist\. Of course ideally if it can do end to end formalization of proofs that would be amazing, but in the meantime being able to more reliably tackle smaller subproblems that contribute towards the larger goal”
- •“I want tools that can simplify my existing proofs\.”
- •“Autoformalization from natural language, and automatic theorem proving\.”

### A5\.4“What would you want to see most in future AI\-based tools for formalizing proofs?”

- •“If I already have an idea of the solution, I would consult a large language model for ideas on how to go about formalizing parts of the problem\. The AI tool could suggest tactics and theorems to use, and maybe give a working formalization\. I didn’t try asking it for suggestions on how to solve a problem \(because we already had solutions\), but I think that would be cool\.”
- •“I mostly want AI to take care of tedious low\-level details of a proof; I’m happy to do the main structure\.”
- •“It would be great if I could quickly outline the shape of the proof, such as ":= by – induction on n, using core lemmas \[X, Y, Z\]" without explaining my thought process too deeply, and having the skeleton filled in\. The closer a tool can allow me to type in only the key pieces, and having the rest of the proof/syntax get filled in, the better\.”
- •“I don’t like AI\-based tools writing too much code for me \- it makes me lose my mental model of how the program functions\. But if an AI\-based tool can fill in an entire proof for me, and the proof is succinct and compiles quickly, then my mental model stays intact\. The problem is that AI\-based proofs are often not succinct or efficient, and so I have to trade\-off writing the proof correctly from the start, or editing a bad proof into a good one\.”
- •“I think a nice workflow for formalization would be one where you could describe a mathematical idea or proof gradually to a chatbot and get suggestions for ways to formalize it in Lean\. In this interaction, the human only needs to have a passing familiarity with the Lean syntax, and a bulk of the work will be done by the machine\. To keep the response time reasonable, ideally the chatbot should not spend an enormous amount of time autonomously formalizing or discovering proofs, but instead should develop these in conversation with the human\.”
- •“I think I would use it quite regularly for generating small tactics that help me do the mundane stuff, but overall I probably would still have to guide the overall structure of the proof and think of next steps”
- •“More real\-time feedback, in an unobstrusive way\. Interactive theorem proving strongly relies on using goal states and other feedback from the theorem prover\. AI\-based workflows should integrate better with this information and use the information rather than trying to one\-shot the entire proof\.”

## A6Survey Questionnaire

We provide the full survey questionnaire circulated to respondents\. The survey unfolded over three pages: a brief introduction, the consent form, and the main questionnaire\.

### Introduction:

AI Tool Use in Formalization ProcessesWe are interested in understanding how people do \(or don’t\) use AI\-based tools as part of their formalization workflows\. Thank you in advance for filling out this survey\!Three randomly\-selected participants will receive a $30 Amazon gift card\!The survey should take no more than 5–10 min\.Responses from the survey will be analyzed by us and other researchers to better understand how people are using AI\-based tools \(or not using them\!\) as part of formalization\. Emails are collected only to reduce spammers and to contact gift card winners\. Emails will be discarded after the survey collection and not shared in any future data analysis\. Only open\-ended commentary responses and experience questions \(de\-linked from emails\) may be shared with our later publication\.There will first be a brief consent form on data sharing and then the survey\!Any questions?kmc61@cam\.ac\.ukandsimon\.frieder@cs\.ox\.ac\.uk∗Indicates required questionQ1\.Email∗

### Consent form:

This survey is for research purposes\.By completing this survey, you are participating in a study being performed by researchers from the University of Cambridge and other universities \(e\.g\., CMU, Caltech, NYU\)\. The purpose of this research is to evaluate the role of AI and non\-AI tool\-based assistance in formalization\.You must be at least 18 years old to participate\. There are neither specific benefits nor anticipated risks associated with participation in this study\. Your participation in this study is completely voluntary and you can withdraw at any time by simply exiting the study\. You may decline to answer any or all of the following questions\. Choosing not to participate or withdrawing will result in no penalty\. Your anonymity is assured; the researchers who have requested your participation will not receive any personal information about you, and any information you provide will not be shared in association with any personally identifying information\.We will analyze and release the free form text responses, as well as the other question data \(emails will not be included\!\) Please do not participate unless you are okay with the responses potentially being shared\.You may print a copy of this consent form for your records\.If you have questions about this research, please contact the researchers by sending an email tokmc61@cam\.ac\.uk\. These researchers will do their best to communicate with you in a timely, professional, and courteous manner\.If you have questions regarding your rights as a participant, or if problems arise which you do not feel you can discuss with the researchers, please contact the University of Cambridge Dept of Engineering Ethics Offices\.Q2\.I am age 18 or older∗\(Yes; No\)Q3\.I have read and understand the information above\.∗\(Yes; No\)Q4\.I want to participate in this research and continue with the survey\.∗\(Yes; No\)

### Main questionnaire:

Onto the survey\!Now here are a few questions \(should not take more than 5–10 min\)\. Thank you for your time\!Q5\.How would you classify the amount of formalization experience you have \(in any formalization language, e\.g\., Lean, Rocq, Isabelle\)?∗\(Early beginner; Beginner; Moderate; Advanced; Expert\)Q6\.How would you classify your level of \(specifically\) Lean experience?∗\(Early beginner; Beginner; Moderate; Advanced; Expert\)Q7\.How often do you use \*any\* formal proof language \(Lean, Isabelle, Rocq, and/or others\)?∗\(I’ve never used a proof formalization language; I’ve only used a proof formalization language once; Once a month; A few times a month; Once a week; A few times a week; Every day; Multiple times a day\)Q8\.Provide a short textual description for the projects in the previous two questions\. If they coincide, please state so\.∗\(Open\-ended text response\)Q9\.How often do you use \(specifically\) Lean?∗\(I’ve never used Lean; I’ve only used Lean once; Once a month; A few times a month; Once a week; A few times a week; Every day; Multiple times a day\)Q10\.How many proof formalization languages would you say you have moderate comfort with?∗\(0 \(None\); 1; 2; 3; 4; 5; 6; 7;\>\>7\)Q11\.What is the highest level of mathematics you studied attained?∗\(High school level maths; Undergraduate level maths \(non\-major, but took courses\); Undergraduate level maths \(major\); Postgraduate \(Masters\); Postgraduate \(PhD\); Professional/working mathematician; Other\)Q12\.Have you ever contributed to the Lean mathematics library Mathlib?∗\(Yes; No\)Q13\.How often do you find yourself using AI\-based tools when working on formalization?∗\(I always use AI\-based tools when formalizing; I sometimes use AI\-based tools when formalizing; I rarely use AI\-based tools when formalizing; I never use AI\-based tools when formalizing\)Q14\.If you use AI\-based tools \(e\.g\., GitHub Copilot, ChatGPT, Claude, Lean Copilot, LeanSearch, LeanAide, Moogle, Kimina Prover\) in your formalization process, which tools do you use and in what way do you use each tool? Please write at least one bullet for each tool and method of use\.If you never use AI\-based tools when formalizing, please write why not \(e\.g\. unaware of them\)\.∗\(Open\-ended text response\)Q15\.What do you like most about AI\-based tools for formalizing \(answer with respect to one, or a few tools, that you mentioned above\)∗\(Open\-ended text response\)Q16\.What frustrates you most about the current state of AI\-based tools for formalizing?∗\(Open\-ended text response\)Q17\.What would you want to see most in future AI\-based tools for formalizing proofs?∗\(Open\-ended text response\)Q18\.What do you envision any future workflow pattern for how you may go about discovery new proofs / formalizing them?∗\(Open\-ended text response\)Q19\.Where would you want tools that help you? Where would you NOT want AI\-based tools to be used?∗\(Open\-ended text response\)Q20\.Anything else you’d like to add?\(Open\-ended text response\)

## A7Participant Instructions

Participants in our controlled user study were given the following instructions and told which order to do their TOOLS or NO\-TOOLS weeks, as outlined above\. The instructions included links to the consent form and post\-survey\. Participants were sent instructions via email\. Following a few questions from participants, we sent a clarifying email on tools, as outlined below\.

### Instructions

1\.Please formalize the following three problems over the course of this week\. This includes both the formalization of the problem statement and the solution\.2\.When formalizing the problem statement, aim to stay as close as possible to the original statement\.3\.For the solution, you may either follow the provided approach or develop your own, as long as it leads to a correct and complete formal proof\. We have formalized the solutions ourselves, so you can rest assured that the problems are formalizable without undue effort\.4\.Don’t worry if you cannot finish any problem\. You can work on the problems in any order, including starting one problem, stopping, and starting another before coming back to a problem\. We care about your process\! We what do ask you, is to make sure that you submit*a single screen recording per problem*and not mix work on multiple problems within a single screen recording\.5\.If this is assigned for yourNO\-TOOLS week, please DO NOT use any external tools for assistance in formalizing the proof\. If you use an IDE, such as Visual Studio Code, make sure to also turn off any GitHub Copilot or related tools within that environment\.6\.If this is yourTOOLS\-ALLOWED week, you are allowed to use any AI\-based tool\(s\) for assistance \(this includes LLMs, tools specific to the Lean ecosystem, as well as any other tools you might find useful\)\.7\.We kindly ask that whenever you are formalizing the problems, at any time, please record your ENTIRE screen \(e\.g\., using QuickTime video, vokoscreenNG or any other tool you might find useful\)\. We ask that you record*all*interactions related to your problem solving; this include web browsing \(if you use Google search, or LLMs\), as well as any other software that assists you in formalizing\. If you ever forget to record your screen, please note this in the material that you submit back as part of the study\. While it is possible that you may forget, if you do this more than twice, you may not be paid\. Do not worry if you accidentally screen\-record personal matters \- there is an option to black these out at the end, to exclude any frames that reveal personal information\.

#### Submission Details

At the end of the study, pleasesubmit a zipwith the following information:•Your formalized proofs for each problem – in a file labeled “\{your\-last\-name\}\{your\-first\-name\}\{problem\-id\}\.txt”•Any screen recording for that problem labeled “\{your\-last\-name\}\{your\-first\-name\}\{problem\-id\}\-\{record\-idx\}\.mp4” for each record\-idx \(between 0 to however many clips you took for that problem\)•A[Google form](https://forms.gle/KBX6Ec3TS5LpfBGfA)with your post\-survey responses, including a general questionnaire about your prior mathematics and Lean experience\.•If you use scratch paper at any time during the study, we ask that you screenshot or upload the scratchpaper with the label “‘\{your\-last\-name\}\{your\-first\-name\}\{problem\-id\}\-\{page\-idx\}\.png/pdf”Any / all of the information may be released as part of the data collected in this work\. However, we will scrub any names so the data is anonymized\. We will then stitch these videos together\. We will make the screen recordings public \(and anonymized\) as part of the study contributions; however, you will be given a chance to “mask” any part of the recording before publication of your accidentally screen\-recorded material that should not be made public\.By participating, you agree to having data shared and affirming that you are at least 18 years old\. We will communicate payment details after the study\.Please make sure you have filled out the[consent form](https://forms.gle/MQ3VpZjNLatkmRc68)\(also emailed\) before you begin\.You only need to fill out the consent form once\.

### Clarification email excerpts

We wanted to provide some clarification on what counts as a “tool“ for the study, following a great question by a participant\. In our instructions we said: “If this is assigned for your NO\-TOOLS week, please DO NOT use any external tools for assistance in formalizing the proof\.” but we realized it might not be clear to everyone where exactly to draw the line what a tool is and what not \- we received some questions from some of you about this, so we wanted to write an email to everyone to make sure we are all aligned: If you use Google search, or any other “tool” that is neither specifically designed to aid math and formalization, this counts as NO TOOL, the only exception being LLMs, which do count as TOOL \(not all LLMs were are designed to aid math and formalization, but still provide significant support, which is why we categorized them as tools\)\. We don’t want write a list of potential tools here, since we do not want to bias you\. But we want to encourage you nonetheless to ask us beforehand, so if you are unsure, whether something counts as tool or not \- PLEASE ASK US FIRST\.

You are not expected to spend more than 12 hrs on the problems over the two weeks\. We recognize some problems are more difficult than others, and may take longer\. You may not be able to finish all problems in 12 hrs\! We encourage you to make sure you at least attempt all problems \(rather than spend all your time on a subset and miss out on trying one or more problems\)\. Of course if you want to spend more than 12 hours, awesome\!, but that is not expected at all\.
Characterizing initial human-AI proof formalization workflows

Similar Articles

@rohanpaul_ai: Google DeepMind's new paper. Shows that AI can now search formal mathematics proofs, but only inside carefully constrai…

@paulg: Interesting. AI will in effect increase both supply and demand for formal methods. You need them more, but you also hav…

Open ai

@geoffreyirving: New paper with Gopal Sarma, Rachel Steratore, and Sunny Bhatt, and me surveying formal methods folk about importance an…

Our First Proof submissions

Submit Feedback

Similar Articles

@rohanpaul_ai: Google DeepMind's new paper. Shows that AI can now search formal mathematics proofs, but only inside carefully constrai…
@paulg: Interesting. AI will in effect increase both supply and demand for formal methods. You need them more, but you also hav…
@geoffreyirving: New paper with Gopal Sarma, Rachel Steratore, and Sunny Bhatt, and me surveying formal methods folk about importance an…