Google DeepMind’s latest research agent tackled the FirstProof challenge and what it means for the future of professional mathematics.
- Autonomous Mastery: Powered by Gemini 3 Deep Think, the AI agent Aletheia successfully solved 6 out of 10 research-level problems in the inaugural FirstProof challenge without human intervention.
- Defining Autonomy: The experiment sets a new benchmark for “autonomous” AI, proving that models can now generate peer-review quality proofs that adhere to the rigorous scholarship standards of professional mathematicians.
- Rapid Evolution: Aletheia showed significant performance gains over its late 2025 predecessors, demonstrating that improvements in agentic scaffolding and base models are rapidly closing the gap between AI and human expertise.

On February 5, 2026, the mathematics community introduced a gauntlet designed to test the absolute limits of artificial intelligence: FirstProof. Unlike previous benchmarks that relied on textbook problems or competitive math puzzles, FirstProof consists of ten research-level questions that arose naturally in the work of professional mathematicians. These aren’t just “hard” problems; they represent the cutting edge of contemporary thought, requiring a level of rigor and scholarship typically found in peer-reviewed journals.
Google DeepMind’s response to this challenge was Aletheia, a research agent built upon the Gemini 3 Deep Thinkarchitecture. The results, recently released following the February 13 deadline, mark a historic milestone. Aletheia autonomously solved 60% of the challenge, providing proofs that—according to a majority of expert assessors—met the high standards of the mathematical literature.

The Question of Autonomy
One of the most significant aspects of the FirstProof trial was the debate over what truly constitutes an “autonomous” solution. The challenge authors, led by Abouzaid et al., specified that an AI should not rely on human input for mathematical ideas or help in isolating the core of a problem. However, the boundary between “guidance” and “peer review” remains a nuanced topic in the AI community.
DeepMind took a conservative and transparent approach to this ambiguity. While human peer review often involves back-and-forth clarification, DeepMind ensured that Aletheia’s solutions were generated without an “expert in the loop.” This distinction is vital; for problems of this caliber, even asking the right technical question requires significant expertise. By removing the human from the iterative process, DeepMind proved that Aletheia could not only solve the math but also navigate the “scholarship” of the field—providing precise citations to peer-reviewed journals and arXiv preprints.

Technical Prowess and Evolution
The performance of the Aletheia agents (A and B) represents a stark leap forward compared to the versions used just months ago for the Erdős problems in late 2025. This success is attributed to a dual-pronged upgrade: more robust agentic scaffolding and the raw reasoning power of the Gemini 3 Deep Think base model.
The experiment utilized a “best-of-2” approach. While both Aletheia A and B encountered individual “false positives,” their combined performance yielded credible solutions to six problems. Interestingly, even the publicly available version of Gemini 3 Deep Think—when sampled by humans—produced a solution to the complex Problem 10 that matched the optimal theoretical complexity bound discovered by the specialized Aletheia agent. This suggests that the underlying reasoning capabilities are becoming more democratized, even as specialized scaffolding pushes the ceiling higher.

Toward a Collaborative Future
The FirstProof results are not without their complexities. For instance, experts were not unanimous on the validity of the solution for Problem 8, highlighting that even at the highest levels of mathematics, “truth” can be a matter of rigorous debate.
However, the broader perspective remains clear: we are entering an era where AI is no longer just a calculator or a coding assistant, but a legitimate collaborator in the discovery of new mathematical truths. Aletheia’s ability to navigate these problems autonomously suggests that the day when AI contributes as a primary author on groundbreaking mathematical papers is no longer a distant dream, but a rapidly approaching reality.

