top of page

HumanEval benchmark: Testing LLMs on coding

Updated: Dec 29, 2024

The HumanEval benchmark is an important LLM evaluation tool.


It tests how well LLMs generate accurate code from docstrings, making it a key measure of coding proficiency.


Let’s dive in.



Best LLM for the HumanEval benchmark

Comparing the main frontier models on the HumanEval benchmark leaderboard.

HumanEval benchmark across frontier LLMs

Last updated: December, 2024

Company

Model

Score

Source

OpenAI

GPT-4o

90.2%

Meta

Llama 3.1 405B

89.0%

xAI

Grok-2

88.4%

Anthropic

Claude3 Opus

84.9%

Google

Gemini 2.0

Unknown

OpenAI’s GPT-4o leads with an impressive 90.2%, showcasing its strength in coding capabilities.

Meta’s Llama 3.1 405B (89.0%) and xAI’s Grok-2 (88.4%) are close competitors, reflecting the intense race to develop the best LLM for HumanEval.



What is the HumanEval benchmark?

HumanEval tests how well LLMs can generate correct code based on docstrings.


It was introduced by Chen et al. (2021) as a way to evaluate a model’s coding ability using real-world programming tasks.


The test includes 164 coding problems that consist of:

  • Function signatures

  • Docstrings

  • Code bodies

  • Unit tests


The final HumanEval score is the average accuracy of the LLM across all tasks.



Other LLM benchmarks

At BRACAI, we keep track of how the main frontier models perform across multiple benchmarks.


If you have any questions about these benchmarks, or how to get started with AI in your business, feel free to reach out.

Comments


bottom of page