top of page

MATH benchmark: Testing the best LLM for math

  • Falk Thomassen
  • Jul 17, 2024
  • 2 min read

Updated: Mar 26

The MATH benchmark is an important LLM evaluation tool.


It tests LLMs on math problems with the goal of determining which LLM is best at math.


Let’s dive in.



Best LLM for the MATH benchmark

Comparing the main frontier models on the MATH benchmark.

Best LLM for math, comparing frontier models

Last updated: March, 2025

Company

Model

MATH

Source

xAI

Grok-3

93.3%

Google

Gemini 2.5

92.0%

OpenAI

GPT-o3 mini

87.3%

Anthropic

Claude 3.7 Sonnet

80.0%

There is a big difference between the top-performing LLMs in the MATH benchmark scores.


xAI’s Grok-3 leads the pack with 93.3%, while Gemini 2.5 follows closely with 92.0%, and GPT-o3 mini (the best performing version of ChatGPT on this benchmark) posts a strong 87.3%.


Claude 3.7 Sonnet comes in last of this group, at 80.0%.


Meta’s Llama 3.1 405B is not included, as no official MATH or AIME scores have been released.


What is the MATH LLM benchmark?

The MATH LLM benchmark is, for once, not an acronym—it simply stands for math.


It was introduced by Hendrycks et al. (2021) as a way to evaluate how well LLMs perform on challenging math problems.


This benchmark consists of 12,500 problems sourced from high school math competitions and covers topics like:

  • Algebra

  • Geometry

  • Probability

  • Calculus


It’s a tough test:

  • A PhD student without a strong math background scored 40%

  • A three-time IMO gold medalist scored 90% (IMO = International Mathematical Olympiad)


When the dataset was first introduced, even the best LLMs only managed 6.9%. Today, frontier models like Claude 3.7 Sonnet have come close to human expert performance, reaching nearly 97%.


AIME 2024: A new frontier in math evaluation

Many researchers now use AIME 2024 as a modern benchmark to test LLMs on real competition-level math.


AIME 24 is based on the "American Invitational Mathematics Examination" of the Mathematical Association of America.


It includes 30 extremely difficult questions, with single-digit human median scores—even among top high school talent.


While the LLMs used to struggled - with GPT-4 got only 1/30 and Claude and Gemini 2/30, according to Google Deep Mind - the models are now catching up.

This benchmark is quickly becoming the new gold standard for evaluating deep mathematical reasoning.


Other LLM benchmarks

At BRACAI, we keep track of how the main frontier models perform across multiple benchmarks.


If you have any questions about these benchmarks, or how to get started with AI in your business, feel free to reach out.

コメント


bottom of page