top of page

Chatbot Arena: A Grassroot LLM Evaluation

Updated: Dec 12

LLM evals between OpenAI, Anthropic, Mistral and Google

Large language models, commonly known as LLMs, have become very important in our new AI world.


But how can you tell which one is the best?


There are many methods. One great way is to use the Chatbot Arena by LMSYS Org, an open research organization founded by students and faculty from UC Berkeley.


In this post, we discuss the performance of LLMs in the Chatbot Arena and why it is a good approach for LLM evals.



Which LLM Model is Best?

Chatbot arena leaderboard ranking of LLM models
Total #models: 54. Total #votes: 213576. Last updated: Jan 9, 2024.

As the LLM race intensifies, a key question emerges: Which LLM model stands out as the best? The latest rankings in the Chatbot Arena provide valuable insights into this.


OpenAI Lead: OpenAI models, particularly the GPT-4 series, continue to dominate as the most capable LLM. The GPT-4 Turbo, with its top-notch Elo rating, holds the crown. However, it's not just about one model outshining the others; it's about a consistent trend of excellence, as evidenced by GPT-4 variants securing the top three positions. This dominance speaks volumes about OpenAI's strategic focus on iterative improvements and robust AI capabilities.


Mistral AI Rising Star: In an impressive feat, Mistral AI has climbed the ranks, now challenging the top players. This surge is not just a testament to Mistral's technical capabilities but it also shows the potential of open research organizations in shaping the future of AI. Funded by renowned investors like Amazon, Google, Salesforce, and Zoom, Mistral AI's success story adds a new dynamic to the LLM landscape.


Anthropic Steady Presence: Anthropic, with its series of Claude models, remains a strong contender. While these models consistently rank among the top performers, there are signs of fluctuating performance. Issues like higher refusal rates in newer versions like Claude 2.1 might be impacting user satisfaction, calling for a more nuanced understanding of user preferences and model optimization.


Google Mid-Table Performance: Google's Gemini models, including Gemini Pro, have secured spots in the middle of the table. While not at the forefront, their performance is on par with significant models like GPT-3.5. This middle-tier ranking indicates solid advancements from Google in the LLM domain, however, not in the forefront.


The Open Source vs Proprietary Debate: A notable observation from the latest rankings is the difference between proprietary and open-source models. While proprietary models, primarily from big players like OpenAI, continue to lead, open-source models are not far behind. This trend highlights the growing influence of open-source contributions in AI and the potential for a more democratized AI landscape.



What Makes Chatbot Arena Unique?

 LMSYS Chatbot Arena Leaderboard

In trying to determine the best LLM, the Chatbot Arena by LMSYS is a very insightful platform. What sets it apart is its innovative approach to evaluating and comparing language models. Unlike traditional benchmarks that often rely on technical metrics difficult for the general audience to grasp, the Chatbot Arena adopts a user-centric method.


User-Driven Insights: At the heart of the Chatbot Arena's evaluation process is the user experience. By anonymizing the outputs from various models, it ensures an unbiased assessment of their capabilities. This approach allows users to interact with the models without preconceived notions, ensuring that the evaluations are based solely on performance and user satisfaction.


Elo Rating System: Borrowing from the world of chess, the Chatbot Arena utilizes the Elo rating system to rank the LLMs. This dynamic system adjusts the ratings based on user interactions and feedback, providing a live, evolving leaderboard of model competencies. This method offers a more intuitive and accessible way for users to understand and compare the performance of different models.


Transparency and Accessibility: By making the latest model ratings and statistics publicly accessible, the Chatbot Arena fosters a transparent and inclusive environment. Enthusiasts, researchers, and casual users alike can delve into the rankings at Chatbot Arena Ratings to see the current standings, making the complex world of LLMs more approachable and understandable.


A More Comprehensive Evaluation: The Chatbot Arena's approach goes beyond traditional benchmarks by considering a holistic view of LLM performance. It's not just about how technically proficient a model is but also about how well it resonates with actual users. This user-focused benchmarking offers a more rounded and practical evaluation of LLMs, reflecting real-world usability.



Conclusion: Analyzing the LLM Landscape with the Chatbot Arena


With the rapid changes in LLM models, the Chatbot Arena is a great tool for understanding and comparing each one of them. This grassroot evaluation provides valuable information in a digestible way in the complex world of LLMs.


It will be very interesting to continue to follow this space going forward.

Comments


bottom of page