top of page

HumanEval benchmark - testing LLMs on coding

The HumanEval benchmark is an important LLM evaluation tool.


It tests LLMs on coding .


Let’s deep-dive.



LLM performance based on HumanEval

LLM performance based on MATH score

Last updated: July, 2024

Company

Model

HumanEval

Source

Open AI

GPT-4o

90.2%

Anthropic

Claude3 Opus

84.9%

Meta

Llama 3 400B

84.1%

Google

Gemini 1.5 Pro

84.1%

Meta

Llama 3 70B

81.7%

Anthropic

Claude 3 Haiku

75.9%

Google

Gemini 1.0 Ultra

74.4%

OpenAI’s GPT-4o tops the list with a 90.2% score, showing their strong lead in AI tech. Anthropic’s Claude3 Opus and Meta’s Llama 3 400B also rank high, reflecting fierce competition in this space.



What is the HumanEval benchmark?

HumanEval tests how well LLMs generate correct code from docstrings. It was introduced by Chen et al. (2021).


The test includes 164 coding problems, each with function signatures, docstrings, code bodies, and unit tests.


The final score averages performance across all tasks.


Pros & cons with the HumanEval benchmark

Pros

Cons

  • Standard way to compare models

  • Covers various coding tasks

  • Some tasks lack context, making it tough for models

  • Dataset might have errors


Comments


bottom of page