top of page

LLM benchmark comparison

Comparing frontier AI models across main LLM evaluation benchmarks:

MMLU

Testing general knowledge

GPQA

Testing domain-specific expertise

Chatbot arena

Rankings based on human votes

MATH

Testing mathematical ability

HumanEval

Testing coding proficiency

MMMU benchmark

Testing reasoning with images

bottom of page