ZIP-FIT: 圧縮ベースのアラインメントを介した埋め込みフリーなデータ選択

要旨

データ選択は、特定のタスクにおける言語モデル（LM）のパフォーマンスを最適化する上で重要ですが、ほとんどの既存の手法は対象タスクの分布を効果的に考慮していません。現在のアプローチは、対象タスク固有の要件を完全に無視するか、Autoformalizationやコード生成などのタスクに必要な微妙なパターンを捉えるのに必要な近似を依存することがあります。対象分布を考慮する手法は、しばしばハッシュ化されたn-gram特徴などの単純で時にノイズの多い表現に依存しており、衝突を引き起こしたりノイズを導入する可能性があります。私たちは、ZIP-FITというデータ選択フレームワークを導入します。このフレームワークは、gzip圧縮を使用して潜在的なトレーニングデータと対象タスク分布との整合性を直接測定します。 AutoformalizationやPythonコード生成における幅広い評価によると、ZIP-FITはDSIRやD4などの主要なベースラインよりも優れたパフォーマンスを発揮します。 ZIP-FITで選択されたデータでトレーニングされたモデルは、ベースラインよりも最大85.1％速く最も低い交差エントロピー損失を達成し、より効率的な学習が可能であることを示しています。さらに、ZIP-FITはDSIRよりも最大65.8％速く選択を行い、D4よりも2桁速いです。特筆すべきは、ZIP-FITが、より小さなが、よりターゲットに合ったデータセットがしばしば、より大きなが、よりターゲットに合わないデータセットよりも優れていることを示しており、少量の高品質データが大量の低品質データよりも優れていることを実証しています。私たちの結果は、効率的なドメイン適応においてタスクに注意したデータ選択が重要であり、圧縮がタスクの整合性を測定するための原則的な方法を提供していることを示唆しています。ターゲットされたデータ選択がタスク固有のパフォーマンスを劇的に向上させることを示すことで、私たちの研究はデータ品質、タスクの整合性、およびモデルの学習効率の関係に新たな示唆を提供しています。

English

Data selection is crucial for optimizing language model (LM) performance on specific tasks, yet most existing methods fail to effectively consider the target task distribution. Current approaches either ignore task-specific requirements entirely or rely on approximations that fail to capture the nuanced patterns needed for tasks like Autoformalization or code generation. Methods that do consider the target distribution often rely on simplistic, sometimes noisy, representations, like hashed n-gram features, which can lead to collisions and introduce noise. We introduce ZIP-FIT, a data selection framework that uses gzip compression to directly measure alignment between potential training data and the target task distribution. In extensive evaluations on Autoformalization and Python code generation, ZIP-FIT significantly outperforms leading baselines like DSIR and D4. Models trained on ZIP-FIT-selected data achieve their lowest cross-entropy loss up to 85.1\% faster than baselines, demonstrating that better task alignment leads to more efficient learning. In addition, ZIP-FIT performs selection up to 65.8\% faster than DSIR and two orders of magnitude faster than D4. Notably, ZIP-FIT shows that smaller, well-aligned datasets often outperform larger but less targeted ones, demonstrating that a small amount of higher quality data is superior to a large amount of lower quality data. Our results imply that task-aware data selection is crucial for efficient domain adaptation, and that compression offers a principled way to measure task alignment. By showing that targeted data selection can dramatically improve task-specific performance, our work provides new insights into the relationship between data quality, task alignment, and model learning efficiency.

ZIP-FIT: 圧縮ベースのアラインメントを介した埋め込みフリーなデータ選択

ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

要旨

Support