GPT, Claude, Llama? How to tell which AI model is best

An illustration showing a lineup of robots of different shapes and sizes.

2024-07-31  1350  困难

On MMLU, for example, the most powerful version of Llama 3.1 scored 88.6%, against 88.7% for GPT-4o and 88.3% for Claude 3.5 Sonnet, rival models made by OpenAI and Anthropic, two AI startups, respectively. Claude 3.5 Sonnet had itself been unveiled on June 20th, again with a table of impressive benchmark scores. And on July 24th, the day after Llama 3.1’s debut, Mistral, a French AI startup, announced Mistral Large 2, its latest LLM, with—you’ve guessed it—yet another table of benchmarks. Where do such numbers come from, and can they be trusted?

经济学人和华尔街日报的文章是会员专属

请加入会员以继续阅读完整文章

成为会员后您将享受无限制的阅读体验,并可使用更多功能


免责声明:本文来自网络公开资料,仅供学习交流,其观点和倾向不代表本站立场。