Essential-Web v1.0：24T 個經過整理的網路數據標記

摘要

數據在語言模型獲取技能和知識的過程中扮演著最為關鍵的角色。缺乏大規模且組織良好的預訓練數據集，導致數據管道成本高昂且難以獲取。我們推出了Essential-Web v1.0，這是一個包含24萬億個標記的數據集，其中每個文檔都標注了一個包含十二個類別的分類體系，涵蓋主題、格式、內容複雜性和質量。這些分類標籤由EAI-Distill-0.5b生成，這是一個經過微調的0.5億參數模型，其標注一致性與Qwen2.5-32B-Instruct的差距在3%以內。僅需使用SQL風格的過濾器，我們就能在數學（相對於SOTA降低8.0%）、網絡代碼（提升14.3%）、STEM（提升24.5%）和醫學（提升8.6%）領域獲得具有競爭力的網絡精選數據集。Essential-Web v1.0現已於HuggingFace平台開放：https://huggingface.co/datasets/EssentialAI/essential-web-v1.0。

English

Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

Essential-Web v1.0：24T 個經過整理的網路數據標記

Essential-Web v1.0: 24T tokens of organized web data

摘要

Support