Jul 171 min readHumanEval benchmark - testing LLMs on codingThe HumanEval benchmark is an important LLM evaluation tool. It tests LLMs on coding . Let’s deep-dive. LLM performance based on...
Jul 171 min readMATH benchmark - testing LLMs on math problemsThe MATH benchmark is an important LLM evaluation tool. It tests LLMs on math problems. Let’s deep-dive. LLM performance based on MATH...
Jul 171 min readGPQA benchmark - testing LLMs on graduate-level questionsThe GPQA benchmark is an important LLM evaluation tool. It assesses how well LLMs handle complex, domain-specific questions in biology,...
Jan 281 min readMMLU benchmark - testing LLMs multi-task capabilitiesThe MMLU benchmark is an important LLM evaluation tool. It tests LLMs' multi-task capabilities. Let’s deep-dive. LLM performance based on...
Jan 172 min readMT Bench: Evaluating LLMsWith the recent advances of LLMs, AI has started to be used for many tasks, from writing and chatting to coding. However, evaluating...
Jan 123 min readChatbot Arena: A Grassroot LLM EvaluationLarge language models, commonly known as LLMs, have become very important in our new AI world. But how can you tell which one is the...