AI benchmarks are a bad joke – and LLM makers are the ones laughing

admin

Nov 8, 2025 - 14:42

0 0

AI benchmarks are a bad joke – and LLM makers are the ones laughing

AI companies regularly tout their models' performance on benchmark tests as a sign of technological and intellectual superiority. But those results, widely used in marketing, may not be meaningful.

A study [PDF] from researchers at the Oxford Internet Institute (OII) and several other universities and organizations has found that only 16 percent of 445 LLM benchmarks for natural language processing and machine learning use rigorous scientific methods to compare model performance.

What's more, about half the benchmarks claim to measure abstract ideas like reasoning or harmlessness without offering a clear definition of those terms or how to measure them.

In a statement, Andrew Bean, lead author of the study said, "Benchmarks underpin nearly all claims about advances in AI. But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to."

When OpenAI released GPT-5 earlier this year, the company's pitch rested on a foundation of benchmark scores, such as those from AIME 2025, SWE-bench Verified, Aider Polyglot, MMMU, and HealthBench Hard.

These tests present AI models with a series of questions and model makers strive to have their bots answer as many as possible. The questions or challenges vary depending upon the focus of the test. For a math-oriented benchmark like AIME 2025, AI models are asked to answer questions like:

Find the sum of all positive integers $n$ such that $n+2$ divides the product $3(n+3)(n^2+9)$.

"[GPT-5] sets a new state of the art across math (94.6 percent on AIME 2025 without tools), real-world coding (74.9 percent on SWE-bench Verified, 88 percent on Aider Polyglot), multimodal understanding (84.2 percent on MMMU), and health (46.2 percent on HealthBench Hard)—and those gains show up in everyday use," OpenAI said at the time. "With GPT‑5 pro's extended reasoning, the model also sets a new SOTA on GPQA, scoring 88.4 percent without tools."

But, as noted in the OII study, "Measuring what Matters: Construct Validity in Large Language Model Benchmarks," 27 percent of the reviewed benchmarks rely on convenience sampling, meaning that the sample data is chosen for the sake of convenience rather than using methods like random sampling or stratified sampling.

"For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."

The OII study authors have created a checklist with eight recommendations to make benchmarks better. These include defining the phenomenon being measured, preparing for contamination, and using statistical methods to compare models. Alongside the OII, the other study authors are affiliated with EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University.

Bean et al. are far from the first to question the validity of AI benchmark tests. In February, for example, researchers from the European Commission's Joint Research Center published a paper titled, "Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation."

As we noted at the time, the authors of that research identified "a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results."

At least some of those who design benchmark tests are aware of these concerns. On the same day that the OII study was announced, Greg Kamradt, president of the Arc Prize Foundation, a non-profit that administers an award program based on the Abstract and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) benchmark, announced, "ARC Prize Verified, a program to increase the rigor of evaluating frontier systems on the ARC-AGI benchmark."

Verification and testing rigor are necessary, Kamradt observed, because scores reported by model makers or third-parties may arise from different datasets and prompting methods that make comparison difficult.

"This causes confusion in the market and ultimately detracts from our goal of measuring frontier AI progress," Kamradt explained.

OpenAI and Microsoft reportedly have their own internal benchmark for determining when AGI – vaguely defined by OpenAI as "AI systems that are generally smarter than humans" – has been achieved. That milestone matters to the two companies because it releases OpenAI from its IP rights and Azure API exclusivity agreement with Microsoft.

This AGI benchmark, according to The Information, can be met by OpenAI developing AI systems that generate at least $100 billion in profits. Measuring money turns out to be easier than measuring intelligence. ®