ChatPaper.aiChatPaper

MAGA:大規模類型-觀眾重組以擴展預訓練語料庫

MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion

February 6, 2025
作者: Xintong Hao, Ke Shen, Chenggang Li
cs.AI

摘要

儘管大型語言模型在各種任務上展現出卓越的能力,但它們持續擴展面臨一個關鍵挑戰:高質量預訓練數據的稀缺。儘管模型架構不斷演進,自然語言數據卻難以擴展。為了應對這一瓶頸,我們提出了大規模體裁-受眾(MAGA)重組方法,系統地從現有語料庫中綜合合成多樣、上下文豐富的預訓練數據。本研究主要貢獻有三點:(1)我們提出了MAGA重組方法,這是一種輕量且可擴展的預訓練語料擴展方法,並建立了一個包含770B標記的MAGA語料庫。 (2)我們使用不同的數據預算擴展策略評估了MAGA語料庫,展示了在各種模型大小(134M-13B)上持續改進,確立了下一代大規模合成預訓練語言模型的必要性。 (3)通過全面分析,我們研究了提示工程對合成訓練崩潰的影響,並揭示了使用驗證損失的常規崩潰檢測指標存在的局限性。我們的工作表明,MAGA能夠大幅擴展訓練數據集,同時保持質量,為超越數據限制擴展模型提供了可靠的途徑。
English
Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottleneck, we propose MAssive Genre-Audience~(MAGA) reformulation method, which systematic synthesizes diverse, contextually-rich pretraining data from existing corpus. This work makes three main contributions: (1) We propose MAGA reformulation method, a lightweight and scalable approach for pretraining corpus expansion, and build a 770B tokens MAGACorpus. (2) We evaluate MAGACorpus with different data budget scaling strategies, demonstrating consistent improvements across various model sizes (134M-13B), establishing the necessity for next-generation large-scale synthetic pretraining language models. (3) Through comprehensive analysis, we investigate prompt engineering's impact on synthetic training collapse and reveal limitations in conventional collapse detection metrics using validation losses. Our work shows that MAGA can substantially expand training datasets while maintaining quality, offering a reliably pathway for scaling models beyond data limitations.

Summary

AI-Generated Summary

PDF222February 7, 2025