top of page

MATH benchmark: Testing the best LLM for math

Updated: 22 hours ago

The MATH benchmark is an important LLM evaluation tool.


It tests LLMs on math problems with the goal of determining which LLM is best at math.


Let’s dive in.



Best LLM for the MATH benchmark

Comparing the main frontier models on the MATH benchmark.

Best LLM for math, comparing frontier models

Last updated: December, 2024

Company

Model

MATH

Source

Google

Gemini 2.0

89.7%

OpenAI

GPT-4o

76.6%

xAI

Grok-2

76.1%

Meta

Llama 3.1 405B

73.8%

Anthropic

Claude3 Opus

60.1%

There is a big difference between the top-performing LLMs in the MATH benchmark scores.


While OpenAI's GPT-4o used to lead with its 76.6% (as of July 2024), it has now been overtaken by Google's Gemini 2.0, which achieved an impressive 89.7%.


Meanwhile, xAI's Grok-2 has caught up, scoring 76.1% and coming close to GPT-4o.



What is the MATH LLM benchmark?

The MATH LLM benchmark is, for once, not an acronym—it simply stands for math.


It was introduced by Hendrycks et al. (2021) as a way to evaluate how well LLMs perform on challenging math problems.


This benchmark consists of 12,500 problems sourced from high school math competitions and covers topics like:

  • Algebra

  • Geometry

  • Probability

  • Calculus


It’s a tough test:

  • A PhD student without a strong math background scored 40%

  • A three-time IMO gold medalist scored 90% (IMO = International Mathematical Olympiad)


When the dataset was first introduced, even the best LLMs only managed 6.9%. Today, frontier models like Gemini 2.0 have come close to human expert performance, reaching nearly 90%.



Other LLM benchmarks

At BRACAI, we keep track of how the main frontier models perform across multiple benchmarks.


If you have any questions about these benchmarks, or how to get started with AI in your business, feel free to reach out.

コメント


bottom of page