Leading AI models fail new test of artificial general intelligence

You May Be Interested In:‘We got stuck in puddles’: skiers upset by lack of snow on Swedish slopes


The ARC-AGI-2 benchmark is designed to be a difficult test for AI models

Just_Super/Getty Images

The most sophisticated AI models in existence today have scored poorly on a new benchmark designed to measure their progress towards artificial general intelligence (AGI) – and brute-force computing power won’t be enough to improve, as evaluators are now taking into account the cost of running the model.

There are many competing definitions of AGI, but it is generally taken to refer to an AI that can perform any cognitive task that humans can do. To measure this, the ARC Prize Foundation previously launched a test of reasoning abilities called ARC-AGI-1. Last December, OpenAI announced that its o3 model had scored highly on the test, leading some to ask if the company was close to achieving AGI.

But now a new test, ARC-AGI-2, has raised the bar. It is difficult enough that no current AI system on the market can achieve more than a single-digit score out of 100 on the test, while every question has been solved by at least two humans in fewer than two attempts.

In a blog post announcing ARC-AGI-2, ARC president Greg Kamradt said the new benchmark was required to test different skills from the previous iteration. “To beat it, you must demonstrate both a high level of adaptability and high efficiency,” he wrote.

The ARC-AGI-2 benchmark differs from other AI benchmark tests in that it focuses on AI models’ abilities to complete simplistic tasks – such as replicating changes in a new image based on past examples of symbolic interpretation – rather than their ability to match world-leading PhD performances. Current models are good at “deep learning”, which ARC-AGI-1 measured, but are not as good at the seemingly simpler tasks, which require more challenging thinking and interaction, in ARC-AGI-2. OpenAI’s o3-low model, for instance, scores 75.7 per cent on ARC-AGI-1, but just 4 per cent on ARC-AGI-2.

The benchmark also adds a new dimension to measuring an AI’s capabilities, by looking at its efficiency in problem-solving, as measured by the cost required to complete a task. For example, while ARC paid its human testers $17 per task, it estimates that o3-low costs OpenAI $200 in fees for the same work.

“I think the new iteration of ARC-AGI now focusing on balancing performance with efficiency is a big step towards a more realistic evaluation of AI models,” says Joseph Imperial at the University of Bath, UK. “This is a sign that we’re moving from one-dimensional evaluation tests solely focusing on performance but also considering less compute power.”

Any model that is able to pass ARC-AGI-2 would need to not just be highly competent, but also smaller and lightweight, says Imperial – with the efficiency of the model being a key component of the new benchmark. This could help address concerns that AI models are becoming more energy-intensive  sometimes to the point of wastefulness – to achieve ever-greater results.

However, not everyone is convinced that the new measure is beneficial. “The whole framing of this as it testing intelligence is not the right framing,” says Catherine Flick at the University of Staffordshire, UK. Instead, she says these benchmarks merely assess an AI’s ability to complete a single task or set of tasks well, which is then extrapolated to mean general capabilities across a series of tasks.

Performing well on these benchmarks should not be seen as a major moment towards AGI, says Flick: “You see the media pick up that these models are passing these human-level intelligence tests, where actually they’re not; what they are doing is really just responding to a particular prompt accurately.”

And exactly what happens if or when ARC-AGI-2 is passed is another question – will we need yet another benchmark? “If they were to develop ARC-AGI-3, I’m guessing they would add another axis in the graph denoting [the] minimum number of humans – whether expert or not – it would take to solve the tasks, in addition to performance and efficiency,” says Imperial. In other words, the debate over AGI is unlikely to be settled soon.

Topics:

share Paylaş facebook pinterest whatsapp x print

Similar Content

The Google Gemini logo.
New hack uses prompt injection to corrupt Gemini’s long-term memory
AMD’s trusted execution environment blown wide open by new BadRAM attack
Europe is looking for alternatives to US cloud providers
iOS 18.2 developer beta adds ChatGPT and image-generation features
iOS 18.2 developer beta adds ChatGPT and image-generation features
2YHYWF4 HERE - FILM STILLS. 2024 . USA. Tom Hanks and Robin Wright in Here - (c)Sony Pictures - is a 2024 American drama film produced and directed by Robert Zemeckis based on the 2014 graphic novel by Richard McGuire. The film is a nonlinear story covering a single spot of land and its inhabitants from the distant past to the present. Tom Hanks, Robin Wright, Paul Bettany, and Kelly Reilly star. Release November 2024. Captioned 12 November 2024 Ref: LMK110-MB071-121124 Supplied by LMKMEDIA. Editorial Only. Landmark Media is not the copyright owner of these Film or TV stills but provides a service
New Tom Hanks film fails to wow despite the cutting-edge tech
RCS texting updates will bring end-to-end encryption to green bubble chats
RCS texting updates will bring end-to-end encryption to green bubble chats
A red and silver modem, with small "wings" on the top, and the typical green light bars on the front.
German router maker is latest company to inadvertently clarify the LGPL license
The News Spectrum | © 2025 | News