LLM MATH benchmark

The MATH benchmark is an important LLM evaluation tool.

It tests LLMs on math problems with the goal of determining which LLM is best at math.

Let’s dive in.

Best LLM for the MATH benchmark

Comparing the main frontier models on the MATH benchmark.

Best LLM for math, comparing frontier models

Last updated: March, 2025

Company	Model	MATH	Source
xAI	Grok-3	93.3%	link
Google	Gemini 2.5	92.0%	link
OpenAI	GPT-o3 mini	87.3%	link
Anthropic	Claude 3.7 Sonnet	80.0%	link

There is a big difference between the top-performing LLMs in the MATH benchmark scores.

xAI’s Grok-3 leads the pack with 93.3%, while Gemini 2.5 follows closely with 92.0%, and GPT-o3 mini (the best performing version of ChatGPT on this benchmark) posts a strong 87.3%.

Claude 3.7 Sonnet comes in last of this group, at 80.0%.

Meta’s Llama 3.1 405B is not included, as no official MATH or AIME scores have been released.

What is the MATH LLM benchmark?

The MATH LLM benchmark is, for once, not an acronym—it simply stands for math.

It was introduced by Hendrycks et al. (2021) as a way to evaluate how well LLMs perform on challenging math problems.

This benchmark consists of 12,500 problems sourced from high school math competitions and covers topics like:

Algebra
Geometry
Probability
Calculus

It’s a tough test:

A PhD student without a strong math background scored 40%
A three-time IMO gold medalist scored 90% (IMO = International Mathematical Olympiad)

When the dataset was first introduced, even the best LLMs only managed 6.9%. Today, frontier models like Claude 3.7 Sonnet have come close to human expert performance, reaching nearly 97%.

AIME 2024: A new frontier in math evaluation

Many researchers now use AIME 2024 as a modern benchmark to test LLMs on real competition-level math.

AIME 24 is based on the "American Invitational Mathematics Examination" of the Mathematical Association of America.

It includes 30 extremely difficult questions, with single-digit human median scores—even among top high school talent.

While the LLMs used to struggled - with GPT-4 got only 1/30 and Claude and Gemini 2/30, according to Google Deep Mind - the models are now catching up.

This benchmark is quickly becoming the new gold standard for evaluating deep mathematical reasoning.

Other LLM benchmarks

At BRACAI, we keep track of how the main frontier models perform across multiple benchmarks.

If you have any questions about these benchmarks, or how to get started with AI in your business, feel free to reach out.

MATH benchmark: Testing the best LLM for math

Best LLM for the MATH benchmark

What is the MATH LLM benchmark?

AIME 2024: A new frontier in math evaluation

Other LLM benchmarks

Recent Posts

コメント