Curious to learn how DeepSeek performs?
We look at how this Chinese AI startup performs against leading AI models on critical benchmarks.
Let’s dive in.
How does DeepSeek perform vs frontier AI models?
Last updated: January, 2025
LLM | Company | MMLU | MATH | GPQA | HumanEval | Chatbot Arena |
GPT-4o | OpenAI | 88.7% | 76.6% | 53.6% | 90.2% | 98.8% |
Claude3 Opus | Anthropic | 86.8% | 60.1% | 50.4% | 84.9% | 92.8% |
Gemini 2.0 | 76.4% | 89.7% | 62.1% | N/A | 100.0% | |
Llama 3.1 405B | Meta | 88.6% | 73.8% | 51.1% | 89.0% | 91.8% |
Grok-2 | xAI | 87.5% | 76.1% | 56.0% | 88.4% | 93.2% |
DeepSeek-V3 | DeepSeek | 88.5% | 90.2% | 59.1% | 82.6% | 95.3% |
Source: DeepSeek-V3 Technical Report. Additional sources available here.
DeepSeek outperforms on the MATH benchmark, scoring an impressive 90.2%, the highest in the field. It also holds strong in GPQA and ranks competitively in the Chatbot Arena, where conversational capabilities are put to the test. While it lags slightly in HumanEval, the metrics indicate it’s closing the gap with more established models like OpenAI's GPT-4o and Anthropic’s Claude3 Opus.
This strong showing, despite being developed with fewer resources, highlights DeepSeek’s strategic focus on optimization—leveraging fewer chips while maintaining competitive accuracy.
What is DeepSeek?
DeepSeek is a small Chinese AI lab founded by hedge fund manager Lian Wenfeng. In January 2025, the company revealed how to build a cost-effective LLMÂ that learns and improves without human supervision.
DeepSeek claims its new model, DeepSeek V3, was developed using far fewer Nvidia chips than US competitors. This raises questions about the future of AI hardware investments and Silicon Valley’s reliance on expensive infrastructure.
Some researchers speculate DeepSeek reduced training costs by leveraging OpenAI’s latest models. While this approach has helped it quickly replicate US advancements, it could limit its ability to surpass them in the future.
Implications for the AI landscape
DeepSeek’s cost-efficient approach could signal a shift in the AI race. By achieving competitive performance with fewer resources, the startup challenges the assumption that massive capital investment is necessary to stay at the forefront of AI development.
For competitors like OpenAI and Google, this raises a question: Will reliance on expensive infrastructure continue to scale, or is there a ceiling?
Additionally, DeepSeek’s breakthroughs are a reminder of the growing geopolitical competition in AI, particularly as China continues to foster innovation in this space. The model’s advancements in benchmarks like MATH also hint at a focus on technical and scientific tasks, potentially aligning with national objectives in tech innovation.
Conclusion
DeepSeek V3 may not yet lead across all metrics, but its combination of cost efficiency and performance positions it as a formidable player. Its ability to innovate on a lean budget is a potential game-changer in the global AI landscape.
If you have any questions about these benchmarks, or how to get started with AI in your business, feel free to reach out.