ラクダはどこまで行けるのか？オープンリソースにおける命令チューニングの現状を探る

要旨

本研究では、一連のオープンな指示追従データセットを用いた言語モデルの指示チューニングに関する最新の進展を探求します。最近の主張では、オープンモデルが最先端のプロプライエタリモデルと同等の性能を発揮できるとされていますが、これらの主張はしばしば限定的な評価に基づいており、モデルを包括的に比較し、さまざまなリソースの有用性を判断することが困難です。我々は、6.7Bから65Bパラメータまでの大規模な指示チューニングモデルを提供し、手動でキュレーションされたもの（例：OpenAssistant）から合成および蒸留されたもの（例：Alpaca）まで、12の指示データセットでトレーニングし、事実知識、推論、多言語性、コーディング、およびオープンエンドの指示追従能力を、自動評価、モデルベース評価、人間ベース評価のコレクションを通じて体系的に評価します。さらに、高品質なオープンリソースの組み合わせでファインチューニングされた、最高性能を発揮する指示チューニングモデルスイートであるT\"uluを紹介します。実験結果から、異なる指示チューニングデータセットが特定のスキルを明らかにしたり強化したりすることが示されましたが、単一のデータセット（またはその組み合わせ）がすべての評価で最高の性能を発揮するわけではありませんでした。興味深いことに、モデルおよび人間の嗜好に基づく評価は、ベンチマークベースの評価によって明らかになるモデル能力の違いを反映しないことがわかり、本研究で実施したような体系的な評価の必要性が示唆されます。評価結果から、任意の評価において最高のモデルは平均してChatGPTの性能の83％、GPT-4の性能の68％に達しており、ギャップを埋めるためには、より優れたベースモデルと指示チューニングデータの構築へのさらなる投資が必要であることが示されました。我々は、完全にファインチューニングされた65BのT\"uluを含む指示チューニングモデル、コード、データ、および評価フレームワークをhttps://github.com/allenai/open-instructで公開し、今後の研究を促進します。

English

In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied by limited evaluation, making it difficult to compare models across the board and determine the utility of various resources. We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets ranging from manually curated (e.g., OpenAssistant) to synthetic and distilled (e.g., Alpaca) and systematically evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities through a collection of automatic, model-based, and human-based metrics. We further introduce T\"ulu, our best performing instruction-tuned model suite finetuned on a combination of high-quality open resources. Our experiments show that different instruction-tuning datasets can uncover or enhance specific skills, while no single dataset (or combination) provides the best performance across all evaluations. Interestingly, we find that model and human preference-based evaluations fail to reflect differences in model capabilities exposed by benchmark-based evaluations, suggesting the need for the type of systemic evaluation performed in this work. Our evaluations show that the best model in any given evaluation reaches on average 83% of ChatGPT performance, and 68% of GPT-4 performance, suggesting that further investment in building better base models and instruction-tuning data is required to close the gap. We release our instruction-tuned models, including a fully finetuned 65B T\"ulu, along with our code, data, and evaluation framework at https://github.com/allenai/open-instruct to facilitate future research.

ラクダはどこまで行けるのか？オープンリソースにおける命令チューニングの現状を探る

How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources

要旨

Support