Essential-Web v1.0：24万亿条结构化网络数据

摘要

数据在语言模型获取技能与知识的过程中扮演着至关重要的角色。缺乏大规模、组织良好的预训练数据集会导致数据管道成本高昂且难以获取。我们推出了Essential-Web v1.0，这是一个包含24万亿个token的数据集，其中每个文档都标注了一个涵盖主题、格式、内容复杂度及质量的十二分类体系。这些分类标签由EAI-Distill-0.5b生成，这是一个经过微调的0.5亿参数模型，其标注一致性达到了与Qwen2.5-32B-Instruct相差不到3%的水平。仅通过SQL风格的过滤操作，我们便能在数学（相较于SOTA降低8.0%）、网页代码（提升14.3%）、STEM（提升24.5%）及医学（提升8.6%）领域获得具有竞争力的网络精选数据集。Essential-Web v1.0现已发布于HuggingFace平台：https://huggingface.co/datasets/EssentialAI/essential-web-v1.0。

English

Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

Essential-Web v1.0：24万亿条结构化网络数据

Essential-Web v1.0: 24T tokens of organized web data

摘要

Support