LLM evaluations

Jul 171 min read

HumanEval benchmark - testing LLMs on coding

The HumanEval benchmark is an important LLM evaluation tool. It tests LLMs on coding . Let’s deep-dive. LLM performance based on...

Jul 171 min read

MATH benchmark - testing LLMs on math problems

The MATH benchmark is an important LLM evaluation tool. It tests LLMs on math problems. Let’s deep-dive. LLM performance based on MATH...

Jul 171 min read

GPQA benchmark - testing LLMs on graduate-level questions

The GPQA benchmark is an important LLM evaluation tool. It assesses how well LLMs handle complex, domain-specific questions in biology,...

Jan 281 min read

MMLU benchmark - testing LLMs multi-task capabilities

The MMLU benchmark is an important LLM evaluation tool. It tests LLMs' multi-task capabilities. Let’s deep-dive. LLM performance based on...

Jan 172 min read

MT Bench: Evaluating LLMs

With the recent advances of LLMs, AI has started to be used for many tasks, from writing and chatting to coding. However, evaluating...

Jan 123 min read

Chatbot Arena: A Grassroot LLM Evaluation

Large language models, commonly known as LLMs, have become very important in our new AI world. But how can you tell which one is the...