GPT-Fathom: 大規模言語モデルのベンチマーキングを通じてGPT-4以降の進化の道筋を解読する

要旨

大規模言語モデル（LLM）の急速な進化に伴い、その能力と限界を評価する包括的な評価スイートの必要性が高まっています。既存のLLMリーダーボードでは、他の論文で報告されたスコアを参照することが多く、設定やプロンプトが一貫していないため、結果を良くするために都合の良い設定やプロンプトを選ぶことが無意識に促される可能性があります。本研究では、OpenAI Evalsを基盤としたオープンソースで再現可能なLLM評価スイート「GPT-Fathom」を紹介します。10以上の主要なLLMおよびOpenAIのレガシーモデルを、7つの能力カテゴリーにわたる20以上の精選されたベンチマークで、統一された設定のもと体系的に評価します。OpenAIの過去のモデルに関する回顧的研究は、GPT-3からGPT-4への進化の道筋について貴重な洞察を提供します。現在、コミュニティはGPT-3がどのようにしてGPT-4へと進化したか、例えばコードデータの追加がLLMの推論能力を向上させるか、SFTやRLHFによってLLMのどの側面が改善されるか、アライメント税がどれほどかといった技術的詳細を知りたがっています。我々の分析は、これらの疑問の多くに光を当て、先進的なLLMの透明性を高めることを目指しています。

English

With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.

GPT-Fathom: 大規模言語モデルのベンチマーキングを通じてGPT-4以降の進化の道筋を解読する

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

要旨

Support