More
    HomeAI NewsTechAI Benchmarks Are a Bad Joke – and LLM Makers Are the...

    AI Benchmarks Are a Bad Joke – and LLM Makers Are the Ones Laughing

    AI companies regularly tout their models’ performance on benchmark tests as a sign of technological and intellectual superiority. But those results, widely used in marketing, may not be meaningful.

    • Rigorous Science is Rare: A new study reveals that only 16% of 445 AI benchmarks for large language models (LLMs) employ sound scientific methods, undermining claims of model superiority.
    • Abstract Concepts Left Undefined: Nearly half of these benchmarks attempt to measure vague ideas like “reasoning” or “harmlessness” without clear definitions or reliable metrics, leading to misleading results.
    • Broader Implications for AI Progress: From marketing hype to internal AGI milestones, flawed benchmarks distort our understanding of AI advancements, prompting calls for better standards and verification.

    In the fast-paced world of artificial intelligence, benchmarks have become the gold standard for proving a model’s worth. Companies like OpenAI trumpet high scores on tests like AIME 2025 or SWE-bench Verified as evidence of groundbreaking progress. But what if these benchmarks are little more than smoke and mirrors? A recent study from the Oxford Internet Institute (OII) and collaborators at institutions including EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University paints a troubling picture. Titled “Measuring what Matters: Construct Validity in Large Language Model Benchmarks,” the research analyzed 445 benchmarks for natural language processing and machine learning tasks. Shockingly, only 16 percent of them use rigorous scientific methods to compare model performance. This isn’t just a minor oversight—it’s a fundamental flaw that calls into question the very foundation of AI hype.

    At the heart of the issue is how these benchmarks are constructed. About half of them claim to evaluate abstract concepts such as reasoning, harmlessness, or even intelligence, but they often fail to define these terms clearly or explain how they’re measured. Lead author Andrew Bean sums it up starkly: “Benchmarks underpin nearly all claims about advances in AI. But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to.” Imagine trying to judge a chef’s skill based on a taste test where “delicious” isn’t defined— that’s the absurdity we’re dealing with in AI evaluations. This lack of clarity allows companies to cherry-pick results that suit their narratives, turning benchmarks into marketing tools rather than scientific instruments.

    Take OpenAI’s launch of GPT-5 earlier this year as a prime example. The company boasted about setting new records across various benchmarks: 94.6 percent on the math-focused AIME 2025 (without tools), 74.9 percent on SWE-bench Verified for real-world coding, 88 percent on Aider Polyglot, 84.2 percent on MMMU for multimodal understanding, and 46.2 percent on HealthBench Hard. They even highlighted an 88.4 percent score on GPQA for extended reasoning. OpenAI claimed these gains translate to “everyday use,” but the OII study suggests otherwise. Many benchmarks, including AIME 2025, rely on convenience sampling—selecting data for ease rather than representativeness. For instance, AIME problems are designed for calculator-free exams, with numbers chosen for simple arithmetic. Testing AI on these alone doesn’t predict performance on tougher, real-world scenarios like larger numbers, where LLMs often falter. It’s like training a sprinter on a treadmill and declaring them ready for a marathon.

    The problems run deeper. The study found that 27 percent of benchmarks use this convenience sampling approach, which skews results and limits generalizability. Questions in math benchmarks like AIME might ask something as intricate as: “Find the sum of all positive integers nn such that n+2n+2 divides the product 3(n+3)(n2+9)3(n+3)(n2+9).” While impressive when solved, these tests don’t necessarily reflect broader capabilities. The researchers aren’t just pointing fingers—they’ve proposed a practical checklist with eight recommendations to fix this mess. These include clearly defining the phenomenon being measured, preparing for data contamination (where models might have “seen” the test data during training), and employing statistical methods for fair model comparisons. It’s a roadmap to making benchmarks trustworthy again.

    This isn’t the first time experts have raised alarms. Back in February, researchers from the European Commission’s Joint Research Center published a paper titled “Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation.” They highlighted systemic flaws like misaligned incentives, construct validity issues (ensuring tests measure what they claim to), unknown unknowns, and the gaming of results. It’s a pattern: AI companies optimize for benchmarks, sometimes at the expense of real-world utility, creating a cycle of inflated claims.

    Even benchmark creators are starting to acknowledge the need for change. On the day the OII study dropped, Greg Kamradt, president of the Arc Prize Foundation, announced “ARC Prize Verified,” a program to rigorize evaluations on the Abstract and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) benchmark. Kamradt noted that varying datasets and prompting methods lead to incomparable scores, causing market confusion and hindering true progress measurement. “This causes confusion in the market and ultimately detracts from our goal of measuring frontier AI progress,” he explained. It’s a step toward transparency, but it underscores how self-reported scores from model makers can be unreliable without independent verification.

    The stakes are even higher when benchmarks tie into bigger milestones, like artificial general intelligence (AGI). OpenAI and Microsoft reportedly have an internal benchmark for AGI—defined vaguely by OpenAI as systems “generally smarter than humans.” Achieving it would free OpenAI from its IP rights and Azure API exclusivity deal with Microsoft. According to reports, this benchmark can be met if AI systems generate at least $100 billion in profits. It’s a refreshingly straightforward metric: measuring money turns out to be easier than measuring intelligence. But it also highlights the absurdity—while companies chase ethereal concepts like “reasoning” in public benchmarks, their private goals boil down to dollars and cents.

    Ultimately, these flawed benchmarks erode trust in AI. They fuel hype that overshadows genuine innovation and could mislead investors, regulators, and the public. As AI integrates deeper into our lives—from healthcare diagnostics to coding assistants—we need evaluations that are robust, not rigged. The OII study’s call for better practices is a wake-up call. If we want AI to live up to its promise, it’s time to stop laughing along with the LLM makers and demand benchmarks that truly measure what matters. Only then can we separate the breakthroughs from the bad jokes.

    Must Read