ZIP-FIT:基于压缩对齐的无嵌入数据选择
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
October 23, 2024
作者: Elyas Obbad, Iddah Mlauzi, Brando Miranda, Rylan Schaeffer, Kamal Obbad, Suhana Bedi, Sanmi Koyejo
cs.AI
摘要
数据选择对于优化语言模型(LM)在特定任务上的性能至关重要,然而大多数现有方法未能有效考虑目标任务分布。
当前方法要么完全忽略任务特定要求,要么依赖无法捕捉Autoformalization或代码生成等任务所需微妙模式的近似方法。
考虑目标分布的方法通常依赖于简单、有时带有噪声的表示,如哈希n-gram特征,可能导致碰撞并引入噪音。
我们引入了ZIP-FIT,一个数据选择框架,使用gzip压缩直接衡量潜在训练数据与目标任务分布之间的对齐情况。
在Autoformalization和Python代码生成的广泛评估中,ZIP-FIT明显优于DSIR和D4等主流基线。
在ZIP-FIT选择的数据上训练的模型的交叉熵损失最多比基线快85.1\%,表明更好的任务对齐导致更高效的学习。
此外,ZIP-FIT的选择速度最多比DSIR快65.8\%,比D4快两个数量级。
值得注意的是,ZIP-FIT显示,较小但对齐良好的数据集通常优于较大但不够精准的数据集,表明少量高质量数据优于大量低质量数据。
我们的结果暗示,任务感知的数据选择对于有效的领域适应至关重要,而压缩提供了一种衡量任务对齐的原则方法。
通过展示有针对性的数据选择可以显著改善任务特定性能,我们的工作为数据质量、任务对齐和模型学习效率之间的关系提供了新的见解。
English
Data selection is crucial for optimizing language model (LM) performance on
specific tasks, yet most existing methods fail to effectively consider the
target task distribution.
Current approaches either ignore task-specific requirements entirely or rely
on approximations that fail to capture the nuanced patterns needed for tasks
like Autoformalization or code generation.
Methods that do consider the target distribution often rely on simplistic,
sometimes noisy, representations, like hashed n-gram features, which can lead
to collisions and introduce noise.
We introduce ZIP-FIT, a data selection framework that uses gzip compression
to directly measure alignment between potential training data and the target
task distribution.
In extensive evaluations on Autoformalization and Python code generation,
ZIP-FIT significantly outperforms leading baselines like DSIR and D4.
Models trained on ZIP-FIT-selected data achieve their lowest cross-entropy
loss up to 85.1\% faster than baselines, demonstrating that better task
alignment leads to more efficient learning.
In addition, ZIP-FIT performs selection up to 65.8\% faster than DSIR and two
orders of magnitude faster than D4.
Notably, ZIP-FIT shows that smaller, well-aligned datasets often outperform
larger but less targeted ones, demonstrating that a small amount of higher
quality data is superior to a large amount of lower quality data.
Our results imply that task-aware data selection is crucial for efficient
domain adaptation, and that compression offers a principled way to measure task
alignment.
By showing that targeted data selection can dramatically improve
task-specific performance, our work provides new insights into the relationship
between data quality, task alignment, and model learning efficiency.Summary
AI-Generated Summary