ChatPaper.aiChatPaper

AICC:更精細的HTML解析,打造更優質模型——基於模型解析器構建的7.3T人工智能就緒語料庫

AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

November 20, 2025
作者: Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu, Pengli Ren, Yuan Qu, Jiahui Peng, Linfeng Hou, Mengjie Liu, Lindong Lu, Wenchang Ning, Jia Yu, Rui Min, Jin Shi, Haojiong Chen, Peng Zhang, Wenjian Zhang, Qian Jiang, Zengjie Hu, Guoqiang Yang, Zhenxiang Li, Fukai Shang, Zhongying Tu, Wentao Zhang, Dahua Lin, Conghui He
cs.AI

摘要

尽管网络数据质量对大语言模型至关重要,但现有数据策展工作多集中于过滤和去重,将HTML到文本的提取视为固定的预处理步骤。当前网络语料库普遍采用基于启发式规则的提取器(如Trafilatura),这类工具难以保持文档结构,且经常破坏公式、代码和表格等结构化元素。我们提出假设:提升提取质量对下游性能的影响不亚于激进过滤策略。为此我们推出MinerU-HTML——一种将内容提取重构为序列标注问题的新型提取流程,该方案通过60亿参数的语言模型实现。与基于文本密度的启发式方法不同,MinerU-HTML利用语义理解能力,采用两阶段格式化流程:先对语义元素进行显式分类,再转换为Markdown格式。关键优势在于,基于模型的方法具有内在可扩展性,而启发式方法的改进路径有限。在包含7,887个标注网页的基准测试集MainWebBench上,MinerU-HTML的ROUGE-N F1值达到81.8%,显著优于Trafilatura的63.6%,且在结构化元素保留方面表现卓越(代码块90.9%,公式94.0%)。基于该技术,我们从两个Common Crawl快照构建了AICC多语言语料库(7.3万亿词元)。在控制变量的预训练实验中,经过相同过滤处理的AICC(620亿词元)在13个基准测试中平均准确率达50.8%,较Trafilatura提取的TfCC提升1.08个百分点,直接证明提取质量对模型能力的重要影响。AICC在关键基准测试上也优于RefinedWeb和FineWeb。我们公开发布MainWebBench、MinerU-HTML和AICC,以此表明HTML提取是网络语料库构建中至关重要却常被低估的环节。
English
While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.
PDF102February 7, 2026