重新思考大规模数据选择：随机选择几乎是你所需的全部。

摘要

監督微調（SFT）對於調整大型語言模型（LLMs）以符合人類指示至關重要。在SFT期間的主要目標是從較大的數據池中選擇一個小而具代表性的訓練數據子集，使得使用該子集進行微調可以達到與甚至超過使用整個數據集獲得的結果相媲美。然而，大多數現有的數據選擇技術是為小規模數據池設計的，無法滿足現實世界SFT場景的需求。本文複製了幾種自我評分方法，這些方法不依賴外部模型輔助，應用於兩百萬規模的數據集，發現幾乎所有方法在應對如此大規模數據池時，很難顯著優於隨機選擇。此外，我們的比較表明，在SFT期間，數據選擇的多樣性比僅專注於高質量數據更為關鍵。我們還分析了幾種當前方法的局限性，解釋了它們在大規模數據集上表現不佳的原因以及為何它們不適用於這樣的情境。最後，我們發現通過標記長度篩選數據提供了一種穩定且高效的改善結果方法。特別是在訓練長文本數據時，這種方法對於相對較弱的基礎模型，如Llama3，非常有益。

English

Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that fine-tuning with this subset achieves results comparable to or even exceeding those obtained using the entire dataset. However, most existing data selection techniques are designed for small-scale data pools, which fail to meet the demands of real-world SFT scenarios. In this paper, we replicated several self-scoring methods those that do not rely on external model assistance on two million scale datasets, and found that nearly all methods struggled to significantly outperform random selection when dealing with such large-scale data pools. Moreover, our comparisons suggest that, during SFT, diversity in data selection is more critical than simply focusing on high quality data. We also analyzed the limitations of several current approaches, explaining why they perform poorly on large-scale datasets and why they are unsuitable for such contexts. Finally, we found that filtering data by token length offers a stable and efficient method for improving results. This approach, particularly when training on long text data, proves highly beneficial for relatively weaker base models, such as Llama3.

重新思考大规模数据选择：随机选择几乎是你所需的全部。

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

摘要

Support