ChatPaper.aiChatPaper

預測性數據選擇:能預測的數據即是能教學的數據

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

March 2, 2025
作者: Kashun Shum, Yuzhen Huang, Hongjian Zou, Ding Qi, Yixuan Liao, Xiaoxin Chen, Qian Liu, Junxian He
cs.AI

摘要

語言模型預訓練涉及在大量語料庫上進行訓練,其中數據質量起著關鍵作用。在本研究中,我們旨在直接估計數據在預訓練過程中的貢獻,並以高效的方式選擇預訓練數據。具體而言,我們從最近的研究成果中獲得啟發,這些研究表明,當文本領域與下游基準對齊時,多樣化模型在特定文本上的壓縮效率(即標準化損失)與其下游性能高度相關(Huang et al., 2024)。基於這一觀察,我們假設那些模型損失能夠預測下游能力的數據,也對學習有顯著貢獻。為利用這一洞察,我們引入了一種基於數據預測強度(PreSelect)的數據選擇方法,這是一種輕量級且高效的數據選擇方法,僅需訓練和部署一個基於fastText的評分器。通過對1B和3B參數模型的全面實驗,我們證明,使用PreSelect選擇的30B tokens訓練的模型,其性能超過了在300B tokens上訓練的普通基線模型,實現了計算需求10倍的降低。此外,在3B模型上訓練100B tokens的規模下,PreSelect顯著優於其他競爭性數據選擇基線,如DCLM和FineWeb-Edu。我們開源了訓練的數據選擇評分器以及精選數據集,詳見https://github.com/hkust-nlp/PreSelect。
English
Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., the normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmark (Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning. To leverage this insight, we introduce data selection based on data's Predictive strength (Preselect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpasses the performance of a vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data selection scorer along with the curated datasets at https://github.com/hkust-nlp/PreSelect.

Summary

AI-Generated Summary

PDF572March 4, 2025