Essential-Web v1.0:24T 個經過整理的網路數據標記
Essential-Web v1.0: 24T tokens of organized web data
June 17, 2025
作者: Essential AI, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, Ashish Vaswani
cs.AI
摘要
數據在語言模型獲取技能和知識的過程中扮演著最為關鍵的角色。缺乏大規模且組織良好的預訓練數據集,導致數據管道成本高昂且難以獲取。我們推出了Essential-Web v1.0,這是一個包含24萬億個標記的數據集,其中每個文檔都標注了一個包含十二個類別的分類體系,涵蓋主題、格式、內容複雜性和質量。這些分類標籤由EAI-Distill-0.5b生成,這是一個經過微調的0.5億參數模型,其標注一致性與Qwen2.5-32B-Instruct的差距在3%以內。僅需使用SQL風格的過濾器,我們就能在數學(相對於SOTA降低8.0%)、網絡代碼(提升14.3%)、STEM(提升24.5%)和醫學(提升8.6%)領域獲得具有競爭力的網絡精選數據集。Essential-Web v1.0現已於HuggingFace平台開放:https://huggingface.co/datasets/EssentialAI/essential-web-v1.0。
English
Data plays the most prominent role in how language models acquire skills and
knowledge. The lack of massive, well-organized pre-training datasets results in
costly and inaccessible data pipelines. We present Essential-Web v1.0, a
24-trillion-token dataset in which every document is annotated with a
twelve-category taxonomy covering topic, format, content complexity, and
quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned
0.5b-parameter model that achieves an annotator agreement within 3% of
Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain
competitive web-curated datasets in math (-8.0% relative to SOTA), web code
(+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on
HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0