指令挖掘:大型语言模型的高质量指令数据选择
Instruction Mining: High-Quality Instruction Data Selection for Large Language Models
July 12, 2023
作者: Yihan Cao, Yanbin Kang, Lichao Sun
cs.AI
摘要
大型语言模型通常经历两个训练阶段,即预训练和微调。尽管大规模预训练赋予模型生成自然语言响应的强大能力,但这些预训练模型有时仍然可能无法理解人类指令。为增强语言模型解释和响应指令的能力,指令微调已成为该领域的关键方法。最近的研究发现,即使只有少量高质量的指令遵循数据,也可以对大型语言模型进行微调以取得良好表现。然而,用于微调语言模型的高质量数据集的选择仍缺乏明确的指导方针。在本文中,我们提出了InstructMining,这是一个评估指令遵循数据质量的线性规则。我们使用特定的自然语言指标来制定InstructMining。为了研究数据质量与这些指标之间的关系,我们进一步进行了大量微调实验。然后,将实验结果应用于估计InstructMining中的参数。为了进一步研究其性能,我们使用InstructMining从未见过的数据集中选择高质量数据。结果表明,InstructMining有助于从各种指令遵循数据集中选择相对高质量的样本。与在未经筛选的数据集上进行微调的模型相比,在InstructMining选择的数据集上进行微调的模型在42.5%的情况下表现更好。
English
Large language models typically undergo two training stages, pretraining and
finetuning. Despite that large-scale pretraining endows the model with strong
capabilities to generate natural language responses, these pretrained models
can still fail to understand human instructions at times. To enhance language
models' ability of interpreting and responding to instructions, instruction
finetuning has emerged as a critical method in this area. Recent studies found
that large language models can be finetuned to perform well even with a small
amount of high-quality instruction-following data. However, the selection of
high-quality datasets for finetuning language models still lacks clear
guidelines to follow. In this paper, we propose InstructMining, a linear rule
for evaluating instruction-following data quality. We formulate InstructMining
using specific natural language indicators. To investigate the relationship
between data quality and these indicators, we further conduct extensive
finetuning experiments. The experiment results are then applied to estimating
parameters in InstructMining. To further investigate its performance, we use
InstructMining to select high-quality data from unseen datasets. Results
demonstrate that InstructMining can help select relatively high-quality samples
from various instruction-following datasets. Compared to models finetuned on
unfiltered datasets, models finetuned on InstructMining selected datasets
perform better on 42.5% cases.