在大规模数据选择方面的反思：随机选择几乎是你所需的全部。

摘要

监督微调（SFT）对齐大型语言模型（LLMs）与人类指令至关重要。在SFT期间的主要目标是从更大的数据池中选择一个小而具代表性的训练数据子集，使得使用该子集进行微调可以达到与甚至超过使用整个数据集获得的结果相媲美的效果。然而，大多数现有的数据选择技术是为小规模数据池设计的，无法满足现实世界SFT场景的需求。在本文中，我们复制了几种不依赖外部模型辅助的自评分方法，应用于两百万规模的数据集，并发现几乎所有方法在处理如此大规模数据池时都难以显著超越随机选择。此外，我们的比较表明，在SFT期间，数据选择中的多样性比简单关注高质量数据更为关键。我们还分析了几种当前方法的局限性，解释了它们在大规模数据集上表现不佳的原因以及为什么它们不适用于这种情境。最后，我们发现按标记长度筛选数据是改善结果的一种稳定高效的方法。特别是在训练长文本数据时，这种方法对于相对较弱的基础模型，如Llama3，非常有益。

English

Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that fine-tuning with this subset achieves results comparable to or even exceeding those obtained using the entire dataset. However, most existing data selection techniques are designed for small-scale data pools, which fail to meet the demands of real-world SFT scenarios. In this paper, we replicated several self-scoring methods those that do not rely on external model assistance on two million scale datasets, and found that nearly all methods struggled to significantly outperform random selection when dealing with such large-scale data pools. Moreover, our comparisons suggest that, during SFT, diversity in data selection is more critical than simply focusing on high quality data. We also analyzed the limitations of several current approaches, explaining why they perform poorly on large-scale datasets and why they are unsuitable for such contexts. Finally, we found that filtering data by token length offers a stable and efficient method for improving results. This approach, particularly when training on long text data, proves highly beneficial for relatively weaker base models, such as Llama3.

在大规模数据选择方面的反思：随机选择几乎是你所需的全部。

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

摘要

Support