Services
About
AI tools
Insights
Career
More
HumanEval benchmark - testing LLMs on coding
MATH benchmark - testing LLMs on math problems
GPQA benchmark - testing LLMs on graduate-level questions
MMLU benchmark - testing LLMs multi-task capabilities
MT Bench: Evaluating LLMs
Chatbot Arena: A Grassroot LLM Evaluation