top of page

MT Bench: Evaluating LLMs

MT Benchmarking

With the recent advances of LLMs, AI has started to be used for many tasks, from writing and chatting to coding. 


However, evaluating their broad capabilities also becomes more challenging.


In this article, we dive into one method for evaluating LLMs, called the MT bench.


This benchmark allows us to better understand how LLMs perform in multi-turn dialogues. 



Performance Insights from MT Bench Rankings

MT Bench leaderboard
Total #models: 54. Total #votes: 213576. Last updated: Jan 9, 2024.

Based on the MT Bench evaluation data, the ranking of the best language models showcases a diverse array of models, each with its unique strengths:

  • OpenAI's GPT-4 Models Dominate: Holding the top three spots, these GPT language models demonstrate exceptional proficiency in complex dialogue management and advanced conversational abilities

  • Mistral AI's Impressive Standing: As an open-source model, Mistral AI's fourth place is significant, indicating the growing competence of accessible AI technology in sophisticated language tasks

  • Diversity in LLM Capabilities: The varied ranking highlights the range of skills different models possess, from nuanced conversation handling to specific task proficiency.

  • Evolving AI Landscape: The rankings reflect the dynamic nature of AI development, with both proprietary and open-source models making notable strides


This ranking provides valuable insights into the current state and performance of different LLMs in handling nuanced and complex conversational tasks.



What is MT Bench?

MT bench explained

MT Bench, also known as the "multi-turn benchmark," is a way of evaluating large language models (LLMs).


This benchmark offers detailed analysis of LLMs performance, particularly focusing on their ability to manage conversation flow and follow instructions in multi-turn dialogues. 

Since we use AI tools as dialogues, at least most of the time, this is very important. 


It is structured around 80 multi-turn questions, each made to evaluate the depth of LLMs in conversations. It covers many categories, including writing, reasoning, coding, and sciences, ensuring a holistic LLM assessment.


Each category contains 10 unique, challenging questions, tailor-made to discern the nuanced capabilities of various models. This comprehensive setup of MT Bench provides an unparalleled lens through which the conversational competencies of LLMs are scrutinized and understood.


For more details, you can read the paper “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” by Lianmin Zheng et al.



Understanding MT-Bench Scoring

Calculation of MT bench score

The MT benchmarking score is based on the LLM models performance across the 80 multi-turn questions. These questions are designed to test the model's ability to maintain coherent and contextually appropriate dialogues over multiple turns.


Each model's response is meticulously analyzed for its adherence to the given instructions and its ability to maintain context and clarity over multiple conversational turns. 


The final score, exemplified by figures like 9.3 for GPT-4-Turbo and 8.6 for Mistral-Medium, covers the model's overall competency in handling multi-turn conversations. 


These scores are reflective not just of the models' ability to generate coherent responses, but also their finesse in navigating the complexities of extended dialogues, thereby offering a holistic view of their conversational capabilities.



Conclusion: MT Bench for LLM Evaluation


In conclusion, the MT Bench benchmarking process stands as a critical tool in the changing AI landscape.


It offers a comprehensive measure of LLMs, crucial for developers and users alike, as AI becomes more relevant into our daily interactions and business functionalities.

bottom of page