スケールにおけるデータ選択の再考：ランダム選択がほぼすべてをカバーする

要旨

監督されたファインチューニング（SFT）は、大規模言語モデル（LLM）を人間の指示に合わせるために重要です。SFT中の主な目標は、より大きなデータプールからトレーニングデータの小さなが代表的なサブセットを選択し、このサブセットでのファインチューニングによって、全データセットを使用した場合と同等またはそれ以上の結果が得られるようにすることです。しかし、既存のデータ選択技術のほとんどは、小規模のデータプール向けに設計されており、実世界のSFTシナリオの要求を満たすことができません。本論文では、外部モデルの支援に依存しないいくつかの自己スコアリング手法を、200万規模のデータセットで再現しました。その結果、ほとんどの手法が、このような大規模データプールを扱う際にランダム選択を大きく上回ることができないことがわかりました。さらに、比較から、SFT中において、データ選択の多様性が単に高品質データに焦点を当てるよりも重要であることが示唆されます。また、いくつかの現行アプローチの限界を分析し、なぜこれらが大規模データセットでパフォーマンスが低く、そのような状況に適していないのかを説明しました。最後に、トークン長によるデータのフィルタリングが結果を改善するための安定かつ効率的な方法であることがわかりました。特に、長いテキストデータでトレーニングする際には、Llama3などの比較的弱いベースモデルにとって非常に有益です。

English

Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that fine-tuning with this subset achieves results comparable to or even exceeding those obtained using the entire dataset. However, most existing data selection techniques are designed for small-scale data pools, which fail to meet the demands of real-world SFT scenarios. In this paper, we replicated several self-scoring methods those that do not rely on external model assistance on two million scale datasets, and found that nearly all methods struggled to significantly outperform random selection when dealing with such large-scale data pools. Moreover, our comparisons suggest that, during SFT, diversity in data selection is more critical than simply focusing on high quality data. We also analyzed the limitations of several current approaches, explaining why they perform poorly on large-scale datasets and why they are unsuitable for such contexts. Finally, we found that filtering data by token length offers a stable and efficient method for improving results. This approach, particularly when training on long text data, proves highly beneficial for relatively weaker base models, such as Llama3.

スケールにおけるデータ選択の再考：ランダム選択がほぼすべてをカバーする

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

要旨

Support