The GPQA benchmark is an important LLM evaluation tool.
It assesses how well LLMs handle complex, domain-specific questions in biology, physics, and chemistry.
Let’s deep-dive.
LLM performance based on GPQA
Last updated: July, 2024
Open AI's GPT-4o leads with 53.6%, showing strong capabilities. Meta’s Llama 3 400B at 48.0% and Google’s Gemini 1.5 Pro at 46.2% follow closely.
The competitive landscape demonstrates high proficiency across models. Even Open AI’s GPT-4, scoring 35.7%, shows room for improvement.
What is the GPQA benchmark?
MMLU stands for "graduate-level Google-proof Q&A." It was introduced by Rein et al. (2023).
The test includes 448 questions across biology, physics, and chemistry.
The test is extremily difficult with expert reach 65% accuracy and non-experts with internet access average only 34%.
Pros & cons with the GPQA benchmark
Pros | Cons |
|
|