命令マイニング：大規模言語モデルのための高品質な命令データ選択

要旨

大規模言語モデルは通常、事前学習とファインチューニングの2段階の訓練を経ます。大規模な事前学習により、モデルは自然な言語応答を生成する強力な能力を獲得しますが、これらの事前学習済みモデルでも、時として人間の指示を理解できない場合があります。言語モデルの指示解釈と応答能力を向上させるため、指示ファインチューニングがこの分野で重要な手法として登場しました。最近の研究では、少量の高品質な指示追従データを用いても、大規模言語モデルをうまくファインチューニングできることが明らかになりました。しかし、言語モデルのファインチューニング用の高品質データセットの選択には、依然として明確なガイドラインが欠けています。本論文では、指示追従データの品質を評価する線形ルールであるInstructMiningを提案します。InstructMiningを特定の自然言語指標を用いて定式化します。データ品質とこれらの指標の関係を調査するため、広範なファインチューニング実験を実施します。実験結果は、InstructMiningのパラメータ推定に適用されます。さらにその性能を調査するため、InstructMiningを使用して未見のデータセットから高品質なデータを選択します。結果は、InstructMiningが様々な指示追従データセットから比較的高品質なサンプルを選択するのに役立つことを示しています。フィルタリングされていないデータセットでファインチューニングされたモデルと比較して、InstructMiningで選択されたデータセットでファインチューニングされたモデルは、42.5%のケースでより良い性能を発揮します。

English

Large language models typically undergo two training stages, pretraining and finetuning. Despite that large-scale pretraining endows the model with strong capabilities to generate natural language responses, these pretrained models can still fail to understand human instructions at times. To enhance language models' ability of interpreting and responding to instructions, instruction finetuning has emerged as a critical method in this area. Recent studies found that large language models can be finetuned to perform well even with a small amount of high-quality instruction-following data. However, the selection of high-quality datasets for finetuning language models still lacks clear guidelines to follow. In this paper, we propose InstructMining, a linear rule for evaluating instruction-following data quality. We formulate InstructMining using specific natural language indicators. To investigate the relationship between data quality and these indicators, we further conduct extensive finetuning experiments. The experiment results are then applied to estimating parameters in InstructMining. To further investigate its performance, we use InstructMining to select high-quality data from unseen datasets. Results demonstrate that InstructMining can help select relatively high-quality samples from various instruction-following datasets. Compared to models finetuned on unfiltered datasets, models finetuned on InstructMining selected datasets perform better on 42.5% cases.

命令マイニング：大規模言語モデルのための高品質な命令データ選択

Instruction Mining: High-Quality Instruction Data Selection for Large Language Models

要旨

Support