指令採礦：大型語言模型的高質量指令數據選擇

摘要

大型語言模型通常會經歷兩個訓練階段，即預訓練和微調。儘管大規模的預訓練賦予模型強大的自然語言生成能力，這些預訓練模型有時仍可能無法理解人類指令。為了增強語言模型解釋和回應指令的能力，指令微調已成為這一領域的關鍵方法。最近的研究發現，即使只有少量高質量的指令遵循數據，也可以對大型語言模型進行微調以取得良好表現。然而，用於微調語言模型的高質量數據集的選擇仍缺乏明確的指導方針。在本文中，我們提出了InstructMining，一個用於評估指令遵循數據質量的線性規則。我們使用特定的自然語言指標來制定InstructMining。為了探討數據質量與這些指標之間的關係，我們進一步進行了廣泛的微調實驗。然後將實驗結果應用於估計InstructMining中的參數。為了進一步研究其性能，我們使用InstructMining從未見過的數據集中選擇高質量的數據。結果表明，InstructMining可以幫助從各種指令遵循數據集中選擇相對高質量的樣本。與在未過濾數據集上進行微調的模型相比，在InstructMining選擇的數據集上進行微調的模型在42.5%的情況下表現更好。

English

Large language models typically undergo two training stages, pretraining and finetuning. Despite that large-scale pretraining endows the model with strong capabilities to generate natural language responses, these pretrained models can still fail to understand human instructions at times. To enhance language models' ability of interpreting and responding to instructions, instruction finetuning has emerged as a critical method in this area. Recent studies found that large language models can be finetuned to perform well even with a small amount of high-quality instruction-following data. However, the selection of high-quality datasets for finetuning language models still lacks clear guidelines to follow. In this paper, we propose InstructMining, a linear rule for evaluating instruction-following data quality. We formulate InstructMining using specific natural language indicators. To investigate the relationship between data quality and these indicators, we further conduct extensive finetuning experiments. The experiment results are then applied to estimating parameters in InstructMining. To further investigate its performance, we use InstructMining to select high-quality data from unseen datasets. Results demonstrate that InstructMining can help select relatively high-quality samples from various instruction-following datasets. Compared to models finetuned on unfiltered datasets, models finetuned on InstructMining selected datasets perform better on 42.5% cases.

指令採礦：大型語言模型的高質量指令數據選擇

Instruction Mining: High-Quality Instruction Data Selection for Large Language Models

摘要

Support