Humans outperform AI at this highly rigorous mathematics test

Reddit r/singularity 06/14/26, 08:56 PM News

mathematics ai-benchmark ai-evaluation research-level first-proof artificial-intelligence

Summary

The First Proof test evaluated four AI systems on novel research-level math problems, with the top model scoring only 6 out of 10, demonstrating that current AI still lags behind top mathematicians in rigorous reasoning.

No content available

Original Article

View Cached Full Text

Cached at: 06/15/26, 12:57 AM

# Humans outperform AI at this highly rigorous mathematics test Source: [https://www.nature.com/articles/d41586-026-01888-9?error=cookies_not_supported&code=6aecaf13-cfea-4818-93db-a18c0c44892a](https://www.nature.com/articles/d41586-026-01888-9?error=cookies_not_supported&code=6aecaf13-cfea-4818-93db-a18c0c44892a) ![3D rendering of abstract hand-written mathematical formulas and drawings (purple and pink on black background) located in the virtual space.](https://media.nature.com/lw767/magazine-assets/d41586-026-01888-9/d41586-026-01888-9_52539570.jpg)The top\-performing artificial\-intelligence model scored 6 out of 10 in the First Proof set of mathematical challenges\.Credit: vitacopS/Getty Artificial intelligence has undergone its most scrupulous maths test yet\. The results are in, and the AI models that took part didn’t live up to the problem\-solving skills of top mathematicians\. The test — part of a project called First Proof, which aims to evaluate the ability of AI to solve complex questions in mathematics — posed ten research\-level maths problems to four AI systems\. A jury of anonymous human specialists in the relevant mathematical fields then assessed the models’ answers\. This test was the first of its kind to satisfy three key conditions simultaneously: first, it consisted of research\-level maths questions; second, it involved problems that did not appear in the training data; and third, it was formally graded by mathematicians\. The results were unveiled on the[First Proof website](https://1stproof.org/second-batch.html#results)on 10 June\. These findings follow recent AI breakthroughs in solving maths problems\. Last month, for example, a chatbot made by the technology firm OpenAI, in San Francisco, California,[solved an 80\-year\-old maths challenge](https://www.nature.com/articles/d41586-026-01651-0)set by the late mathematician Paul Erdős\. The First Proof team says that future iterations of the test could help researchers to judge how useful AI models could be for mathematicians; for example, in solving problems autonomously, checking proofs or acting as research assistants\. ## Prove this One important innovation of the First Proof test was that the questions had not previously been mentioned anywhere in the published literature or on the Internet — cutting the risk that the models could simply be regurgitating information they had learnt during their training\. Instead, ten researchers from a broad range of mathematical specialities each provided a question that they had solved in the course of their own research but had not yet published\. First Proof ran[a trial test in February](https://1stproof.org/first-batch.html)with a different batch of novel problems\. In that round, anyone could try their own favourite AI systems on the problems, and many groups did — but the results were not officially verified by the First Proof team\. There was also no way to independently check that the AIs had not received any help from humans\. [![](https://media.nature.com/w400/magazine-assets/d41586-026-01888-9/d41586-026-01888-9_52496550.png)AI cracks 80\-year\-old mathematics challenge — researchers are astonished](https://www.nature.com/articles/d41586-026-01651-0) This time, First Proof ran the test itself: the team asked the models to solve problems in an entirely autonomous way, and had a group of 30 mathematicians vet the answers\. “The organizers have clearly thought through the second batch more carefully to make it more controlled and systematic,” says mathematician Jeremy Avigad, who heads the Institute for Computer\-Aided Reasoning in Mathematics at Carnegie Mellon University in Pittsburgh, Pennsylvania\. Another rule was that the participating models had to be publicly available\. This meant that Google’s Aletheia — a system designed specifically for solving maths problems — and the full, unreleased version of Claude Mythos, a model made by Anthropic in San Francisco, California, could not be used\. OpenAI was the only big company that participated, with its model ChatGPT 5\.5 Pro\. The other systems were provided by three academic groups, from the University of California, Los Angeles \(UCLA\); Princeton University in New Jersey; and the Swiss Federal Institute of Technology \(ETH\) in Zurich\. All three built ‘harnesses’ on top of existing chatbots, such as ChatGPT, Google’s Gemini and the publicly available version of Anthropic’s Claude\. \(A harness is an automated system that asks a chatbot a question and has the answer checked by another chatbot, often with repeated back\-and\-forth\.\) ## Maths results The ETH team’s model performed the best, solving six out of ten problems with a system in which ChatGPT’s answers were vetted or improved on by an ‘advisory council’ made up of all three major chatbots\. The UCLA team, which built a harness on top of ChatGPT, was the second best, followed by the OpenAI team \(ChatGPT with no harness\) and Princeton \(a harness using mainly Gemini 3\.1 Pro as its backend\)\.

Humans outperform AI at this highly rigorous mathematics test

Similar Articles

Our First Proof submissions

AI outperforms mathematicians

[Google DeepMind] the AI co-mathematician also achieves state of the art results on hard problemsolving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.

OpenAI claims it solved an 80-year-old math problem — for real this time

@rohanpaul_ai: Google DeepMind's new paper. Shows that AI can now search formal mathematics proofs, but only inside carefully constrai…

Submit Feedback

Similar Articles

[Google DeepMind] the AI co-mathematician also achieves state of the art results on hard problemsolving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.

OpenAI claims it solved an 80-year-old math problem — for real this time

@rohanpaul_ai: Google DeepMind's new paper. Shows that AI can now search formal mathematics proofs, but only inside carefully constrai…