GPT-Fathom：基准测试大型语言模型以揭示通往GPT-4及更高版本的演化路径

摘要

随着大型语言模型（LLMs）的快速发展，迫切需要一个全面的评估套件来评估它们的能力和局限性。现有的LLM排行榜经常引用其他论文中报告的分数，但缺乏一致的设置和提示，这可能会无意中鼓励选择有利的设置和提示以获得更好的结果。在这项工作中，我们介绍了GPT-Fathom，这是一个建立在OpenAI Evals之上的开源且可复现的LLM评估套件。我们系统地评估了10多个领先的LLMs以及OpenAI的传统模型在20多个经过精心策划的基准测试上的表现，涵盖了7个能力类别，所有测试都在对齐的设置下进行。我们对OpenAI早期模型的回顾性研究为我们提供了有关从GPT-3到GPT-4的演进路径的宝贵见解。目前，社区急于了解GPT-3如何逐步改进到GPT-4，包括技术细节，比如添加代码数据是否提高了LLM的推理能力，LLM的哪些方面可以通过SFT和RLHF进行改进，对齐税是多少等。我们的分析回答了许多这类问题，旨在提高先进LLMs的透明度。

English

With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.

GPT-Fathom：基准测试大型语言模型以揭示通往GPT-4及更高版本的演化路径

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

摘要

Support