Omri Weinstein is a theoretical computer scientist, the kind who spends years on a single question and can usually tell at a glance whether a proof in his field is real or wishful. This week he wrote that he had changed his mind about something he had not expected to.
"Even OpenAI's recent Erdős breakthrough didn't convince me that LLMs can do general math research," he posted. "This changed my mind."
What changed it was a short, almost unglamorous pipeline built by his former Columbia collaborator Binghui Peng, together with Runzhou Tao, Steven Wang, and Hantao Yu. Pointed at nine open problems, it solved them. Not exercises dressed up as research, but published open questions, some more than a decade old, taken from the problem lists of the field's main conferences. One of them, Weinstein said, had kept him up at night for two years.
How the pipeline works
Strip away the framing and the method is almost mundane, which is part of what makes it striking. One stage plays the prover: handed an open problem, it writes out a full argument. A second stage plays the verifier, reading that argument the way a referee would, hunting for the step that does not follow, the case that was skipped, the lemma quietly assumed rather than proved, and it hands the objections back. The prover revises. The loop runs again. When the verifier stops finding holes, a human takes over.
The proofs themselves were generated by GPT-5.5 Pro. The writeups were assembled with Claude Code, running Claude Opus 4.8, then read, corrected, and signed off by the authors. For the algebra results the team went further and formalized the proofs in Lean 4, using an automated pipeline of their own, so that a proof assistant — not a language model, and not an overworked referee — certifies that every line holds. That last step carries more weight than it sounds. A formalized proof is not a matter of taste. It compiles or it does not.
The nine problems
The specifics are the point, so here is the full list.
| Problem | Source |
|---|---|
| Shuffled SGD and the SS–RS–GD inequalities | COLT 2021 open problem |
| Learning measured-output quantum circuits (partial) | COLT 2015 open problem |
| Unweighted data selection for linear regression | COLT 2025 open problem |
| Robust conditional probability estimation | COLT 2010 open problem |
| Adversarial robustness of online leverage-score sampling | FOCS 2023 |
| Finite-conductor vs. quasi-coherent rings | Commutative ring theory |
| The Cahen–Fontana–Frisch–Glaz conjecture | Commutative ring theory |
| Integer-valued polynomials over algebras | Commutative ring theory |
| A tiling complement question | Erdős Problem 477 |
A few honest caveats belong beside that table. The quantum-learning result is a partial solution rather than a complete one. These are specialized questions, legible to the few dozen researchers who work on each, not the Riemann Hypothesis. And a person polished every writeup before it was posted. None of that is buried; it is all in the authors' own account. It also does nothing to soften the central fact, which is that a machine produced the mathematical ideas that closed problems careful people had tried and failed to close.
Why a skeptic changed his mind
Even @OpenAI's recent Erdős breakthrough didn't convince me that LLMs can do general math research. This changed my mind. Using a clever 'prover-verifier' LLM loop, this harness solved 9 substantial open problems in Theoretical CS, including one that kept me up at night for 2 years.
Omri Weinstein
Endorsements from people selling the technology are cheap. This was not one of those. Weinstein does not work for the labs, he had already looked at the most-publicized recent result and found it wanting, and he had a personal stake in the outcome. One of the nine was a question from his own corner of the field. The FOCS 2023 problem on the adversarial robustness of online leverage-score sampling sits squarely in the area he works on, the sort of question a specialist recognizes on sight and knows the difficulty of from the inside.
The contrast he drew with OpenAI's Erdős episode is worth understanding. Late in 2025, OpenAI staff said their models had cracked a batch of problems from Paul Erdős's famous list. Much of that turned out to be the model surfacing solutions that already existed in the literature, which a popular tracking site had merely listed as open. Impressive retrieval, but not new mathematics, and the "breakthrough" framing drew sharp public correction from working mathematicians. A careful person could file that under good search. What Weinstein could not file that way was a fresh proof, in his own subfield, of a question he had personally failed to answer.
What is new, and what is not
It is easy to overread a result like this, and easy to underread it. Both are worth resisting.
The overread is that mathematics is now automated. It is not. Humans chose the problems, guided the process, checked the output, and wrote the final papers. The models did not wake up one morning and decide to work on shuffled SGD. The pipeline is a tool, aimed by people who already knew which questions were ripe.
The underread is more tempting for anyone who has watched AI claims inflate and deflate for a decade. It says: neat demo, narrow domain, move on. That misses what actually happened. The ideas at the heart of these proofs, the constructions and case analyses and inequalities, came from the models. Verification by Lean and by human experts confirms the ideas are correct. This is not a chatbot bluffing its way through an oral exam. It is a language model contributing the load-bearing part of a research result, repeatedly, across problems chosen precisely because nobody had solved them.
Why this is on a site about AI risk
The distance between "AI can pass a hard exam" and "AI can produce mathematics that experts could not" is the distance this Foundation has spent every article trying to make people feel. Mathematics and theoretical computer science are not just another domain the technology is coming for. They are the substrate it is built from. Optimization, learning theory, and complexity are the raw materials of the next model.
A system that can meaningfully contribute to open problems in those fields is, in principle, a system that can contribute to the design of its successor. That is the mechanism behind a phrase people reach for and rarely examine: recursive self-improvement. Research that speeds up research is the difference between a steady climb and a sprint, and the authors have already said their plan is to aim the same pipeline at every field of science.
Whether it works everywhere is almost beside the point. The direction is set, and it is arriving faster than forecasts that already felt aggressive when they were published. The timeline keeps getting shorter, and results like this are why. The AI 2027 scenario put automated AI research at the center of its story for exactly this reason: once machines are doing the science, the clock speeds up.
None of this is an argument against solving open problems. It is an argument about pace. The people building these systems keep telling us, in public, that the next capability is close, and then it arrives, and it is close. A pipeline a graduate student can run over a weekend now does work a field reserves its respect for. The question the Foundation keeps asking is the only one that scales with the capability: who decides how fast this goes, and on what terms. For now the answer is nobody, and as fast as possible.
Common questions.
A pipeline built by Binghui Peng, Runzhou Tao, Steven Wang, and Hantao Yu resolved nine open problems: five in theoretical computer science and learning theory, drawn from the open-problem lists of the COLT and FOCS conferences; three in commutative ring theory; and one from Erdős's list of combinatorics problems. These were genuinely unsolved, published questions rather than exercises, though one of the learning-theory results is a partial solution. The proofs were generated by GPT-5.5 Pro, the writeups assembled with Claude Opus 4.8, and the algebra results were machine-checked in the Lean 4 proof assistant.
One model acts as a prover and writes out a proposed proof. A verifier stage then critiques it, looking for gaps, missing cases, and unjustified steps, and hands back its objections. The prover revises, and the cycle repeats until the verifier can no longer find a flaw. Human researchers then review the result, and for the algebra problems the team formalized the proofs in Lean 4, so that a proof assistant certifies each step mechanically rather than relying on a model's judgment.
It is new mathematics. In the OpenAI Erdős episode, much of the reported progress came from a model surfacing solutions that already existed in the literature but had been listed as open, which mathematicians criticized as retrieval rather than discovery. Here the models produced the core arguments for problems that had no known solution, several of the results were verified in Lean, and a domain expert with a personal stake in one of the problems vouched for them publicly.
Because mathematics and theoretical computer science are the foundations that AI itself is built on. A system that can contribute original results in optimization, learning theory, and complexity is a system that can, in principle, help design the next system, which is the core of recursive self-improvement and a reason timelines to advanced AI keep shortening. Capability that compounds this way is exactly what makes governing the pace of AI development urgent.