Mathematicians interact with AI, July 2025 update

This is a guest post from Aravind Asok. If you have comments about this, you can contact him at [email protected]. We’ll see if there’s some way to later post moderated comments here.

Recently, several symposia have been organized in which groups of mathematicians interacted with developers of various AI systems (specifically, reasoning models) in a structured way. We have in mind the Frontier Math Symposium hosted by Epoch AI and the Deepmind/IAS workshop. The first of these events received more coverage in the press than the second. It spawned several articles including pieces in Scientific American and the Financial Times, though both articles are currently behind a paywall. Curiously absent from these discussions is any kind of considered opinion of mathematicians regarding these interactions, though hyperbolic quotes from these pieces have made the rounds on social media. Neither of these events was open to the public: participation in both events was limited and by invitation. In both cases the goal was to foster transparent and unguarded interactions.

For context, note that many mathematicians have spent time interacting with reasoning models (Open AI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude among others). While mathematicians were certainly not exempt from the wave of early prompt-based experimentation with initial public models of ChatGPT, they have also explored the behavior of reasoning models on professional aspects of mathematics, testing the models on research mathematics, homework problems, example problems for various classes as well as mathematics competition problems. Anecdotally, reactions run the gamut from dismissal to surprise. However, a structured group interaction with reasoning models provides a qualitatively different experience than these personal explorations. Since invitation to these events was controlled, their audience was necessarily limited; the Epoch event self-selected for those who expressed specific interest in AI, though the IAS/Deepmind event tried to generate a more random cross section of mathematicians.
Much press coverage has a breathless feel, e.g., including coverage of comments by Sam Altman in, say, Fortune. It seems fair to say that mathematicians are impressed with the current performance of models, and, furthermore, see interesting avenues for augmenting mathematical research using AI tools. However, many mathematicians view the rhetoric that “math can be solved”, extrapolating from progress on competition-style mathematics viewed as a game, as problematic at best, and at worst presenting a fundamental misunderstanding of the goals of research mathematics as a whole.

Our discussion here will focus on the Epoch AI-sponsored meeting for concreteness, which was not “secret” in any dramatic or clandestine sense, contrary to some reports. The backstory: Epoch AI has been trying to create benchmarks for the performance of various released LLMs (a.k.a., chatbots like Open AI’s ChatGPT, Anthropic’s Claude, Google Deepmind’s Gemini, etc.). Frontier Math is a benchmark designed to evaluate the mathematical capabilities of reasoning models. This benchmark consists of tiered lists of problems. Tier 1 problems amount to “mathematical olympiad” level problems, while Tiers 2 and 3 are “more challenging” requiring “specialized knowledge at the graduate level.” Frontier Math sought to build a Tier 4 benchmark of “research
level” problems.

Building the Tier 4 benchmark necessitated involving research mathematicians. Earlier this year, Epoch reached out to mathematicians through varying channels. Initial requests promised some amount of money for delivering a problem of a particular type, but many mathematicians unfamiliar with the source of the communication either dismissed it as not credible or had no interest in the monetary compensation. To speed up the collection of Tier 4 problems, Epoch came up with the idea of hosting a symposium. The symposium was advertised on several social media outlets (e.g., Twitter) and various mathematicians were contacted directly by e-mail. Interested participants were sometimes asked to interview with Frontier Math lead mathematician Elliot Glazer and also to produce a prospective problem. Mathematics is a fairly small community so many of the people who attended already knew others who were attending; also the vast majority of attendees came from California. Participants did sign a non-disclosure agreement, but it was limited to information related to the problems that were to be delivered. Symposium participants also had their travel and lodging covered, and were paid a \$1500 stipend for their participation.

Participants were given a list of criteria for problem construction; problems must:

Have a definite, verifiable answer (e.g., a large integer, a symbolic real, or a tuple of such objects) that can be checked computationally.
Resist guesswork: Answers should be “guessproof,” meaning random attempts or trivial brute-force approaches have a negligible chance of success. You should be confident that a person or AI who has found the answer has legitimately reasoned through the underlying mathematics.
Be computationally tractable: The solution of a computationally intensive problem must include scripts demonstrating how to find the answer, starting only from standard knowledge of the field. These scripts must cumulatively run less than a hour on standard hardware.

The participants were divided into groups based on field specificity (number theory, analysis, algebraic geometry, topology/geometry and combinatorics) and told to produce suitable problems.

How did participants contextualize this challenge? In mathematics research one frequently does not know in advance the solution to a given problem, nor whether the problem is computationally tractable. In fact, many mathematicians will agree that knowing a problem is soluble can be game-changing. Moreover, deciding which problems should be deemed worthy of study can be difficult. As a consequence, by and large, participants did not frame the challenge as one of producing research problems, but rather one of simply producing appropriate problems.

Unsurprisingly, ability to construct such problems varied from subject to subject. For example, one geometer said that it was quite difficult to construct “interesting” problems subject to the constraints. There are also real questions about the extent to which “ability to resist guesswork” truly measures “mathematical understanding”. Many participants were rather open about this: even if AI managed to solve the problems they created, they did not feel that would constitute “understanding” in any real sense.

While most participants had written and submitted problems before the symposium started, few people had an idea at that point of what would be “easy” or “hard” for a model. Most of the first day was spent seeing how models interacted with these preliminary problems, and the subsequent discussions refined participants’ understanding of the stipulation that problems were resistant to guesswork. Along the way, models did manage to “solve” some of the problems, but that statement deserves qualification and a more detailed understanding of what constitutes a “solution”.

One key feature of reasoning models was explicit display of “reasoning traces”, showing the models “thinking”. These traces displayed models searching the web and identifying related papers, but their ability to do so was sensitive to the formulation of the problem in fascinating ways. For example, in algebraic geometry, formulating a problem in terms of commutative ring theory instead of varieties could elicit different responses from a model. However, it is a cornerstone of human algebraic geometry to be able to pass back and forth between the two points of view with relative ease. In geometry/topology, participants noted that models demonstrated no aptitude for geometric reasoning. For example, models could not create simple pictorial models (knot diagrams were specifically mentioned) for problems and manipulate them. In algebraic and enumerative combinatorics, models applied standard methods well (e.g., solving linear recurrences, appealing to binomial identities), but if problems required several steps as well as ingenuity models were stymied, even if they were prompted with relevant literature or correct initial steps.

When a model did output a correct answer, examining the reasoning traces sometimes indicated that happened because the problem was constructed in such a way that the answer could be obtained by solving a much simpler but related problem. In terms of the exam solution paradigm, we would probably say such a response was “getting the right
answer for the wrong reason” and assign a failing grade to such a solution!

Participants were routinely told to aim to craft problems that even putative future reasoning models would find difficult. From that standpoint, it was easy to extrapolate that a future model might behave in a more human way, demonstrate “understanding” in a human sense, and isolate the missing key ingredient. This created a pervasive fear that if reasoning traces indicated models seemed “close now”, then one should extrapolate that the problems would be solvable by future models. Participants did observe that if literature in a particular domain was suitably saturated, the models could identify lemmas that would be appropriate and generate relevant mathematics. This was certainly impressive, but one wonders to what extent the natural language output affects perception of the coherence of responses: it is easy for things to “look about right” if one does not read too closely! Eventually, participants did converge on problems that were thought to meet the required bar.

The language models that we worked with were definitely good at keyword search, routinely generating useful lists of references. The models also excelled at natural language text generation and could generate non-trivial code, which made them useful in producing examples. However, press-reporting sometimes exaggerated this, suggesting that reasoning models are “faster” or “better” than professional mathematicians. Of course, such statements are very open to interpretation. On the one hand, this could be trivially true, e.g., calculators are routinely faster than professional mathematicians at adding numbers. Less trivially, it could mean automating complicated algebraic computations, but even this would be viewed by most mathematicians as far from the core essence of mathematical discovery.

The participants at the meeting form a rather thin cross-section of mathematicians who have some interest in the interface between AI (broadly construed) and mathematics. The symposium Signal chat became very active after the Scientific American article was posted. Undoubtedly participants felt there were exciting possible uses of AI for the development of mathematics. There are also real questions about whether or when future “reasoning models” will approach “human-level”competence, as well as serious and fascinating philosophical questions about what that even means; this is a direct challenge for the mathematics community. What does it mean to competently do research mathematics? What is valuable or important mathematics?

Finally, there are important practical questions about the impact, e.g., environmental or geopolitical, of computing at this level. All these questions deserve attention: barring some additional as-yet-unseen theoretical roadblock, reasoning models seem likely to continue improving, underscoring the importance of these questions. As things stand, however—particularly when it comes to mathematical reasoning—caution seems warranted in extrapolating future research proficiency of models.

Source link

What's Hot

Making PD meaningful in today’s classrooms

Just 1 minute of vigorous exercise a day could add years to your life

A crowdsourced project to link up erdosproblems.com to the OEIS

Mathematicians interact with AI, July 2025 update

Bad Craziness | Not Even Wrong

Four new faculty hires are a quantum leap for experimental physics

Out of the Lab, Into the Ballpark

Pythagorean Astronomy: Happy Birthday JWST! – Physics and Astronomy outreach

Professor Emeritus Rainer Weiss, influential physicist who forged new paths to understanding the universe, dies at 92 » MIT Physics

Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

2024 in math puzzles. – Math with Bad Drawings

Testing Quantum Theory in Curved Spacetime

How AI Is Helping Customer Support Teams Avoid Burnout

Chemistry in the sunshine – in C&EN

Irina’s Spring Semester in Valencia, Spain

Living Costs in Limerick | Study in Ireland

These 3 College Students Studied Abroad in Greece

Taylor’s Spring Semester in Athens

Our Picks

Making PD meaningful in today’s classrooms

Just 1 minute of vigorous exercise a day could add years to your life

Subscribe to Updates

What's Hot

Mathematicians interact with AI, July 2025 update

Related Posts

Subscribe to Updates