Chain-of-Thought Hub: 大規模言語モデルの推論性能を継続的に測定する取り組み

要旨

大規模言語モデル（LLM）が継続的に開発される中、その評価はますます重要でありながらも困難な課題となっています。本研究では、大規模言語モデルの多段階推論能力を評価するためのオープンソース評価スイート「Chain-of-Thought Hub」を提案します。この設定に注目する理由は2つあります。(1) GPTやPaLMモデルファミリーの挙動から、複雑な推論能力が弱いLLMと強いLLMを区別する重要な要素であることが観察されるため、(2) 大規模言語モデルが次世代の計算プラットフォームとなり、LLMベースの新しいアプリケーションのエコシステムを促進すると予想されるためです。これには、言語的および論理的操作の組み合わせを含む複雑なタスクを実行できる基盤モデルが自然に必要となります。我々のアプローチは、LLMの進歩を追跡するために、挑戦的な推論ベンチマークのスイートを構築することです。現在の結果は以下のことを示しています。(1) モデルの規模は明らかに推論能力と相関している、(2) 2023年5月時点で、Claude-v1.3とPaLM-2はGPT-4と比較可能な唯一のモデルであり、オープンソースモデルはまだ遅れをとっている、(3) LLaMA-65Bはcode-davinci-002に近い性能を示しており、人間のフィードバックからの強化学習（RLHF）などのさらなる開発が成功すれば、GPT-3.5-Turboに近づく可能性が大きい。また、オープンソースの取り組みが追いつくためには、コミュニティがより優れた基盤モデルの構築とRLHFの探求に焦点を当てるべきであることも示唆されています。

English

As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs; (2) we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations. Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs. Our current results show that: (1) model scale clearly correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind; (3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo. Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF.

Chain-of-Thought Hub: 大規模言語モデルの推論性能を継続的に測定する取り組み

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

要旨

Support