GPT-Fathom：對大型語言模型進行基準測試，以解讀通往 GPT-4 及更高版本的演化路徑。

摘要

隨著大型語言模型（LLMs）的快速發展，迫切需要一個全面的評估套件來評估它們的能力和限制。現有的LLM排行榜通常引用其他論文中報告的分數，但缺乏一致的設置和提示，這可能會無意中鼓勵選擇有利的設置和提示以獲得更好的結果。在這項工作中，我們介紹了GPT-Fathom，這是一個建立在OpenAI Evals之上的開源且可重現的LLM評估套件。我們系統地評估了10多個領先的LLMs以及OpenAI的傳統模型，在7個能力類別下對20多個精心挑選的基準進行了評估，全部在對齊的設置下進行。我們對OpenAI早期模型的回顧研究為我們提供了有價值的見解，從GPT-3到GPT-4的演進路徑。目前，社群急於了解GPT-3如何逐步改進到GPT-4，包括技術細節，例如添加代碼數據是否提高了LLM的推理能力，LLM能力的哪些方面可以通過SFT和RLHF改進，對齊稅是多少等。我們的分析闡明了許多這些問題，旨在提高先進LLMs的透明度。

English

With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.

GPT-Fathom：對大型語言模型進行基準測試，以解讀通往 GPT-4 及更高版本的演化路徑。

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

摘要

Support