Essential-Web v1.0: 24조 토큰의 체계적인 웹 데이터

초록

데이터는 언어 모델이 기술과 지식을 습득하는 데 가장 중요한 역할을 합니다. 대규모로 잘 정리된 사전 학습 데이터셋의 부재는 비용이 많이 들고 접근하기 어려운 데이터 파이프라인을 초래합니다. 우리는 Essential-Web v1.0을 소개합니다. 이는 24조 토큰으로 구성된 데이터셋으로, 모든 문서가 주제, 형식, 내용 복잡도, 품질을 포함한 12개 범주의 분류 체계로 주석 처리되어 있습니다. 분류 체계 레이블은 Qwen2.5-32B-Instruct의 주석자 일치도와 3% 이내의 성능을 보이는 미세 조정된 0.5b 파라미터 모델인 EAI-Distill-0.5b에 의해 생성되었습니다. SQL 스타일 필터만을 사용하여 수학(-8.0%, SOTA 대비), 웹 코드(+14.3%), STEM(+24.5%), 의학(+8.6%) 분야에서 경쟁력 있는 웹 기반 데이터셋을 얻을 수 있습니다. Essential-Web v1.0은 HuggingFace에서 이용 가능합니다: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

English

Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

Essential-Web v1.0: 24조 토큰의 체계적인 웹 데이터

Essential-Web v1.0: 24T tokens of organized web data

초록

Support