New secret math benchmark stumps AI models and PhDs alike

You May Be Interested In:TCL TVs will use films made with generative AI to push targeted ads


Epoch AI allowed Fields Medal winners Terence Tao and Timothy Gowers to review portions of the benchmark. “These are extremely challenging,” Tao said in feedback provided to Epoch. “I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages.”

A chart showing AI models’ limited success on the FrontierMath problems, taken from Epoch AI’s research paper.


Credit:

Epoch AI

To aid in the verification of correct answers during testing, the FrontierMath problems must have answers that can be automatically checked through computation, either as exact integers or mathematical objects. The designers made problems “guessproof” by requiring large numerical answers or complex mathematical solutions, with less than a 1 percent chance of correct random guesses.

Mathematician Evan Chen, writing on his blog, explained how he thinks that FrontierMath differs from traditional math competitions like the International Mathematical Olympiad (IMO). Problems in that competition typically require creative insight while avoiding complex implementation and specialized knowledge, he says. But for FrontierMath, “they keep the first requirement, but outright invert the second and third requirement,” Chen wrote.

While IMO problems avoid specialized knowledge and complex calculations, FrontierMath embraces them. “Because an AI system has vastly greater computational power, it’s actually possible to design problems with easily verifiable solutions using the same idea that IOI or Project Euler does—basically, ‘write a proof’ is replaced by ‘implement an algorithm in code,'” Chen explained.

The organization plans regular evaluations of AI models against the benchmark while expanding its problem set. They say they will release additional sample problems in the coming months to help the research community test their systems.

share Paylaş facebook pinterest whatsapp x print

Similar Content

Chinese nuclear reactor is completely meltdown-proof
Chinese nuclear reactor is completely meltdown-proof
Samsung says it’s in “crisis,” apologizes for missing profit target
Samsung says it’s in “crisis,” apologizes for missing profit target
AI could help shrinking pool of coders keep outdated programs working
AI could help shrinking pool of coders keep outdated programs working
CTV industry’s unprecedented “surveillance”
CTV industry’s unprecedented “surveillance”
OpenAI unveils easy voice assistant creation at 2024 developer event
OpenAI unveils easy voice assistant creation at 2024 developer event
Apple’s first Mac mini redesign in 14 years looks like a big aluminum Apple TV
Apple’s first Mac mini redesign in 14 years looks like a big aluminum Apple TV
The News Spectrum | © 2024 | News