Essential-Web v1.0:24万亿条结构化网络数据
Essential-Web v1.0: 24T tokens of organized web data
June 17, 2025
作者: Essential AI, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, Ashish Vaswani
cs.AI
摘要
数据在语言模型获取技能与知识的过程中扮演着至关重要的角色。缺乏大规模、组织良好的预训练数据集会导致数据管道成本高昂且难以获取。我们推出了Essential-Web v1.0,这是一个包含24万亿个token的数据集,其中每个文档都标注了一个涵盖主题、格式、内容复杂度及质量的十二分类体系。这些分类标签由EAI-Distill-0.5b生成,这是一个经过微调的0.5亿参数模型,其标注一致性达到了与Qwen2.5-32B-Instruct相差不到3%的水平。仅通过SQL风格的过滤操作,我们便能在数学(相较于SOTA降低8.0%)、网页代码(提升14.3%)、STEM(提升24.5%)及医学(提升8.6%)领域获得具有竞争力的网络精选数据集。Essential-Web v1.0现已发布于HuggingFace平台:https://huggingface.co/datasets/EssentialAI/essential-web-v1.0。
English
Data plays the most prominent role in how language models acquire skills and
knowledge. The lack of massive, well-organized pre-training datasets results in
costly and inaccessible data pipelines. We present Essential-Web v1.0, a
24-trillion-token dataset in which every document is annotated with a
twelve-category taxonomy covering topic, format, content complexity, and
quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned
0.5b-parameter model that achieves an annotator agreement within 3% of
Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain
competitive web-curated datasets in math (-8.0% relative to SOTA), web code
(+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on
HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0