INSTRUCTEVAL：命令チューニングされた大規模言語モデルの包括的評価に向けて

要旨

命令チューニングされた大規模言語モデルは、自然言語処理に革命をもたらし、会話エージェントなどのアプリケーションにおいて大きな可能性を示しています。GPT-4のようなこれらのモデルは、言語を習得するだけでなく、数学、コーディング、医学、法律などの分野で複雑なタスクを解決することができます。その印象的な能力にもかかわらず、多くのモデルがブラックボックスであることや、包括的な評価研究の欠如により、その全容を理解するにはまだ課題が残っています。これらの課題に対処するため、私たちは命令チューニングされた大規模言語モデルに特化したより包括的な評価スイートであるINSTRUCTEVALを提案します。従来の研究とは異なり、私たちの評価では、問題解決能力、文章作成能力、人間の価値観との整合性に基づいてモデルを厳密に評価します。私たちは、事前学習の基盤、命令チューニングデータ、学習方法など、モデルのパフォーマンスに影響を与えるさまざまな要因を包括的に分析します。私たちの調査結果から、命令データの品質がモデルのパフォーマンスをスケールする上で最も重要な要因であることが明らかになりました。オープンソースモデルは印象的な文章作成能力を示していますが、問題解決能力と整合性には大幅な改善の余地があります。オープンソースコミュニティによるモデルの急速な発展に勇気づけられていますが、これらのモデルに関する主張を裏付けるための厳密な評価の必要性も強調しています。INSTRUCTEVALを通じて、命令チューニングされたモデルのより深い理解とその能力の進展を促進することを目指しています。INSTRUCTEVALはhttps://github.com/declare-lab/instruct-evalで公開されています。

English

Instruction-tuned large language models have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve complex tasks in areas like mathematics, coding, medicine, and law. Despite their impressive capabilities, there is still a lack of comprehensive understanding regarding their full potential, primarily due to the black-box nature of many models and the absence of holistic evaluation studies. To address these challenges, we present INSTRUCTEVAL, a more comprehensive evaluation suite designed specifically for instruction-tuned large language models. Unlike previous works, our evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and alignment to human values. We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods. Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance. While open-source models demonstrate impressive writing abilities, there is substantial room for improvement in problem-solving and alignment. We are encouraged by the rapid development of models by the open-source community, but we also highlight the need for rigorous evaluation to support claims made about these models. Through INSTRUCTEVAL, we aim to foster a deeper understanding of instruction-tuned models and advancements in their capabilities. INSTRUCTEVAL is publicly available at https://github.com/declare-lab/instruct-eval.

INSTRUCTEVAL：命令チューニングされた大規模言語モデルの包括的評価に向けて

INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

要旨

Support