Artificial intelligence ready to score full marks on one of world's most challenging tests

Artificial intelligence ready to score full marks on one of world's most challenging tests
Donald Trump declares America will ‘win the AI race’ as he unveils radical action plan |

GB NEWS

Marcus Donaldson

By Marcus Donaldson


Published: 30/03/2026

- 19:41

'If we truly cared about this as the only thing in life, I think we could get to it pretty quickly,' one expert said

Artificial intelligence is on the brink of scoring full marks on a test designed to measure the divide between machine learning and human intellect, dubbed "Humanity's Last Exam".

Google's Gemini model recorded an impressive 45.9 per cent on the examination last month, marking a staggering leap from the performance of rival systems just two years prior.


When OpenAI's ChatGPT first attempted the test in 2024, it achieved only 3 per cent accuracy, with competitors from Google and Anthropic faring little better.

The rapid advancement has led researchers at Scale, the firm behind the benchmark test, to predict that AI could reach full marks within approximately twelve months.

The examination comprises 2,500 meticulously selected questions spanning roughly one hundred disciplines, from rocket science and mythology to physiology and ancient languages.

Scale and the Centre for AI Safety, a non-profit organisation, developed the test to probe both the breadth of knowledge and depth of reasoning capabilities in AI systems.

To compile the questions, organisers issued a global appeal in September 2024, offering a $500,000 prize fund to experts who could submit challenges that would be difficult to answer through internet searches.

The response was substantial, with specialists from approximately 50 nations contributing some 70,000 potential questions.

AI apps on phone

Artificial intelligence is ready to score full marks on one of the world's most challenging tests, experts have revealed

|

GETTY

After eliminating any queries that existing models could solve, the list was reduced to 13,000 before final selection.

Each question demands at least doctoral-level comprehension, meaning anyone approaching a perfect score would qualify as a "universal expert."

Calvin Zhang, Scale's research lead, explained the ambition behind the project: "We wanted to create this close-ended academic benchmark, set to the frontier of expert humans, that only a handful of people on earth can really solve."

He praised the developers working on language models, noting: "We've seen over the past few years insane progress on these language models. It's impressive, model builders have really done a great job at improving these reasoning models."

Google Gemini

Google's Gemini model recorded an impressive 45.9 per cent on the examination last mont

|

GETTY

Kate Olszewska, a product manager at Google DeepMind, expressed confidence that the milestone could be reached swiftly if resources were concentrated on the goal.

"If we truly cared about this as the only thing in life, I think we could get to it pretty quickly," she told the Daily Mail.

Anthropic's Claude system has meanwhile achieved 34.2 per cent on the examination, with its scores improving rapidly.

Dr Tung Nguyen, a computer science and engineering professor at Texas A&M University who contributed 73 questions to the examination, offered a more measured assessment of the progress.

AI graphic

'Humanity's Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence,' an expert explained

|

GETTY

"Humanity's Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence," he stated.

While acknowledging strong performances from certain models, Dr Nguyen argued that weaker results from others demonstrate that significant chasms persist.

"When AI systems start performing extremely well on human benchmarks, it's tempting to think they're approaching human‑level understanding," Dr Nguyen observed, adding: "But HLE reminds us that intelligence isn't just about pattern recognition — it's about depth, context and specialised expertise."

He emphasised that the benchmark's purpose was not simply to defeat AI, but to illuminate where human expertise remains essential.