駱駝能走多遠?探索開放資源上指令調整的狀態
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
June 7, 2023
作者: Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi
cs.AI
摘要
在這項工作中,我們探索了最近在一系列開放式指示遵循數據集上指示調整語言模型的最新進展。儘管最近有聲稱開放模型可以與最先進的專有模型媲美,但這些主張通常伴隨著有限的評估,這使得難以全面比較模型並確定各種資源的效用。我們提供了一系列從 67 億到 650 億參數大小的指示調整模型,這些模型是在 12 個指示數據集上進行訓練的,這些數據集從手動精選(例如 OpenAssistant)到合成和精煉(例如 Alpaca)不等,並通過一系列自動、基於模型和基於人類的指標對它們的事實知識、推理、多語言能力、編碼和開放式指示遵循能力進行系統評估。我們進一步介紹了 T\"ulu,我們表現最佳的指示調整模型套件,它是在一系列高質量開放資源的組合上進行了微調。
我們的實驗表明,不同的指示調整數據集可以揭示或增強特定技能,而沒有單一數據集(或組合)能夠在所有評估中提供最佳性能。有趣的是,我們發現基於模型和人類偏好的評估未能反映基於基準的評估所顯示的模型能力差異,這表明有必要進行本工作中進行的類型的系統評估。我們的評估顯示,在任何給定評估中,最佳模型平均達到 ChatGPT 性能的 83%,以及 GPT-4 性能的 68%,這表明需要進一步投資於構建更好的基礎模型和指示調整數據以彌合差距。我們釋放了我們的指示調整模型,包括一個完全微調的 650 億 T\"ulu,以及我們的代碼、數據和評估框架,以促進未來研究。
English
In this work we explore recent advances in instruction-tuning language models
on a range of open instruction-following datasets. Despite recent claims that
open models can be on par with state-of-the-art proprietary models, these
claims are often accompanied by limited evaluation, making it difficult to
compare models across the board and determine the utility of various resources.
We provide a large set of instruction-tuned models from 6.7B to 65B parameters
in size, trained on 12 instruction datasets ranging from manually curated
(e.g., OpenAssistant) to synthetic and distilled (e.g., Alpaca) and
systematically evaluate them on their factual knowledge, reasoning,
multilinguality, coding, and open-ended instruction following abilities through
a collection of automatic, model-based, and human-based metrics. We further
introduce T\"ulu, our best performing instruction-tuned model suite finetuned
on a combination of high-quality open resources.
Our experiments show that different instruction-tuning datasets can uncover
or enhance specific skills, while no single dataset (or combination) provides
the best performance across all evaluations. Interestingly, we find that model
and human preference-based evaluations fail to reflect differences in model
capabilities exposed by benchmark-based evaluations, suggesting the need for
the type of systemic evaluation performed in this work. Our evaluations show
that the best model in any given evaluation reaches on average 83% of ChatGPT
performance, and 68% of GPT-4 performance, suggesting that further investment
in building better base models and instruction-tuning data is required to close
the gap. We release our instruction-tuned models, including a fully finetuned
65B T\"ulu, along with our code, data, and evaluation framework at
https://github.com/allenai/open-instruct to facilitate future research.