Jul 171 min read

GPQA benchmark - testing LLMs on graduate-level questions

The GPQA benchmark is an important LLM evaluation tool.

It assesses how well LLMs handle complex, domain-specific questions in biology, physics, and chemistry.

Let’s deep-dive.

LLM performance based on GPQA

Last updated: July, 2024

Open AI's GPT-4o leads with 53.6%, showing strong capabilities. Meta’s Llama 3 400B at 48.0% and Google’s Gemini 1.5 Pro at 46.2% follow closely.

The competitive landscape demonstrates high proficiency across models. Even Open AI’s GPT-4, scoring 35.7%, shows room for improvement.

MMLU stands for "graduate-level Google-proof Q&A." It was introduced by Rein et al. (2023).

The test includes 448 questions across biology, physics, and chemistry.

The test is extremily difficult with expert reach 65% accuracy and non-experts with internet access average only 34%.

Pros	Cons
Evaluates LLMs on advanced, real-world questions Ensures high-quality, expert-validated questions Tests practical knowledge application in specialized domains	Some questions may be overly complex Limited dataset size Requires substantial domain knowledge, limiting broader use