ChatPaper.aiChatPaper

DataComp-LM:尋找語言模型下一代訓練集的研究

DataComp-LM: In search of the next generation of training sets for language models

June 17, 2024
作者: Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldani, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, Vaishaal Shankar
cs.AI

摘要

我們介紹了用於語言模型(LM)的DataComp(DCLM),這是一個旨在改進語言模型的受控數據集實驗平台。作為DCLM的一部分,我們提供了一個標準化的語料庫,包含從Common Crawl中提取的240T標記,基於OpenLM框架的有效預訓練配方,以及廣泛的53個下游評估。參與DCLM基準測試的參與者可以在模型規模從412M到7B參數的範圍內嘗試數據整理策略,如去重、過濾和數據混合。作為DCLM的基準,我們進行了廣泛的實驗,發現基於模型的過濾對於組合高質量訓練集至關重要。由此產生的數據集DCLM-Baseline使得可以從頭開始訓練一個7B參數的語言模型,在具有2.6T訓練標記的MMLU上實現64%的5-shot準確率。與先前開放數據語言模型的最新技術MAP-Neo相比,DCLM-Baseline在MMLU上的表現提高了6.6個百分點,而計算量減少了40%。我們的基準模型在MMLU上也與Mistral-7B-v0.3和Llama 3 8B相當(63%和66%),並在53個自然語言理解任務的平均表現上與Llama 3 8B相當,但計算量少了6.6倍。我們的結果突顯了數據集設計對於訓練語言模型的重要性,並為進一步研究數據整理提供了一個起點。
English
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.

Summary

AI-Generated Summary

PDF534December 6, 2024