骆驼能走多远?探索开放资源上指令调优的状态
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
June 7, 2023
作者: Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi
cs.AI
摘要
在这项工作中,我们探讨了指令调整语言模型在一系列开放指令遵循数据集上的最新进展。尽管最近有声称开放模型可以与最先进的专有模型媲美,但这些声明通常伴随着有限的评估,这使得很难全面比较模型并确定各种资源的实用性。我们提供了一系列参数从6.7B到65B的大型指令调整模型,这些模型在12个指令数据集上进行训练,这些数据集从手动策划的(例如OpenAssistant)到合成和精炼的(例如Alpaca)不等,并通过一系列自动、基于模型和人类的度量标准对其在事实知识、推理、多语言能力、编码以及开放式指令遵循能力进行系统评估。我们进一步介绍了T\"ulu,我们表现最佳的指令调整模型套件,它是在一系列高质量开放资源的组合上进行了微调。
我们的实验表明,不同的指令调整数据集可以揭示或增强特定技能,而没有任何单一数据集(或组合)能够在所有评估中提供最佳性能。有趣的是,我们发现基于模型和人类偏好的评估未能反映基准评估所暴露的模型能力差异,这表明需要进行类似于本研究中所进行的系统评估。我们的评估显示,在任何给定评估中,最佳模型平均达到ChatGPT性能的83%,以及GPT-4性能的68%,这表明需要进一步投资于构建更好的基础模型和指令调整数据来弥合差距。我们发布了我们的指令调整模型,包括一个完全微调的65B T\"ulu,以及我们的代码、数据和评估框架,网址为https://github.com/allenai/open-instruct,以促进未来研究。
English
In this work we explore recent advances in instruction-tuning language models
on a range of open instruction-following datasets. Despite recent claims that
open models can be on par with state-of-the-art proprietary models, these
claims are often accompanied by limited evaluation, making it difficult to
compare models across the board and determine the utility of various resources.
We provide a large set of instruction-tuned models from 6.7B to 65B parameters
in size, trained on 12 instruction datasets ranging from manually curated
(e.g., OpenAssistant) to synthetic and distilled (e.g., Alpaca) and
systematically evaluate them on their factual knowledge, reasoning,
multilinguality, coding, and open-ended instruction following abilities through
a collection of automatic, model-based, and human-based metrics. We further
introduce T\"ulu, our best performing instruction-tuned model suite finetuned
on a combination of high-quality open resources.
Our experiments show that different instruction-tuning datasets can uncover
or enhance specific skills, while no single dataset (or combination) provides
the best performance across all evaluations. Interestingly, we find that model
and human preference-based evaluations fail to reflect differences in model
capabilities exposed by benchmark-based evaluations, suggesting the need for
the type of systemic evaluation performed in this work. Our evaluations show
that the best model in any given evaluation reaches on average 83% of ChatGPT
performance, and 68% of GPT-4 performance, suggesting that further investment
in building better base models and instruction-tuning data is required to close
the gap. We release our instruction-tuned models, including a fully finetuned
65B T\"ulu, along with our code, data, and evaluation framework at
https://github.com/allenai/open-instruct to facilitate future research.