ZIP-FIT：基于压缩对齐的无嵌入数据选择

摘要

数据选择对于优化语言模型（LM）在特定任务上的性能至关重要，然而大多数现有方法未能有效考虑目标任务分布。当前方法要么完全忽略任务特定要求，要么依赖无法捕捉Autoformalization或代码生成等任务所需微妙模式的近似方法。考虑目标分布的方法通常依赖于简单、有时带有噪声的表示，如哈希n-gram特征，可能导致碰撞并引入噪音。我们引入了ZIP-FIT，一个数据选择框架，使用gzip压缩直接衡量潜在训练数据与目标任务分布之间的对齐情况。在Autoformalization和Python代码生成的广泛评估中，ZIP-FIT明显优于DSIR和D4等主流基线。在ZIP-FIT选择的数据上训练的模型的交叉熵损失最多比基线快85.1\%，表明更好的任务对齐导致更高效的学习。此外，ZIP-FIT的选择速度最多比DSIR快65.8\%，比D4快两个数量级。值得注意的是，ZIP-FIT显示，较小但对齐良好的数据集通常优于较大但不够精准的数据集，表明少量高质量数据优于大量低质量数据。我们的结果暗示，任务感知的数据选择对于有效的领域适应至关重要，而压缩提供了一种衡量任务对齐的原则方法。通过展示有针对性的数据选择可以显著改善任务特定性能，我们的工作为数据质量、任务对齐和模型学习效率之间的关系提供了新的见解。

English

Data selection is crucial for optimizing language model (LM) performance on specific tasks, yet most existing methods fail to effectively consider the target task distribution. Current approaches either ignore task-specific requirements entirely or rely on approximations that fail to capture the nuanced patterns needed for tasks like Autoformalization or code generation. Methods that do consider the target distribution often rely on simplistic, sometimes noisy, representations, like hashed n-gram features, which can lead to collisions and introduce noise. We introduce ZIP-FIT, a data selection framework that uses gzip compression to directly measure alignment between potential training data and the target task distribution. In extensive evaluations on Autoformalization and Python code generation, ZIP-FIT significantly outperforms leading baselines like DSIR and D4. Models trained on ZIP-FIT-selected data achieve their lowest cross-entropy loss up to 85.1\% faster than baselines, demonstrating that better task alignment leads to more efficient learning. In addition, ZIP-FIT performs selection up to 65.8\% faster than DSIR and two orders of magnitude faster than D4. Notably, ZIP-FIT shows that smaller, well-aligned datasets often outperform larger but less targeted ones, demonstrating that a small amount of higher quality data is superior to a large amount of lower quality data. Our results imply that task-aware data selection is crucial for efficient domain adaptation, and that compression offers a principled way to measure task alignment. By showing that targeted data selection can dramatically improve task-specific performance, our work provides new insights into the relationship between data quality, task alignment, and model learning efficiency.

ZIP-FIT：基于压缩对齐的无嵌入数据选择

ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

摘要

Support