AICC:精细化HTML解析,赋能模型升级——基于模型的HTML解析器构建的7.3T AI就绪语料库
AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
November 20, 2025
作者: Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu, Pengli Ren, Yuan Qu, Jiahui Peng, Linfeng Hou, Mengjie Liu, Lindong Lu, Wenchang Ning, Jia Yu, Rui Min, Jin Shi, Haojiong Chen, Peng Zhang, Wenjian Zhang, Qian Jiang, Zengjie Hu, Guoqiang Yang, Zhenxiang Li, Fukai Shang, Zhongying Tu, Wentao Zhang, Dahua Lin, Conghui He
cs.AI
摘要
尽管网络数据质量对大语言模型至关重要,但现有数据筛选工作多集中于过滤与去重处理,将HTML到文本的提取视为固定预处理环节。当前主流网络语料库依赖基于启发式规则的提取器(如Trafilatura),这类工具难以保持文档结构完整性,且经常破坏公式、代码、表格等结构化元素。我们提出假设:提升提取质量对下游任务性能的影响可能不亚于激进的数据过滤策略。为此我们推出MinerU-HTML——一种将内容提取重构为序列标注问题的新型提取流程,该方案通过60亿参数的语言模型实现。与传统基于文本密度的启发式方法不同,MinerU-HTML利用语义理解能力,采用两阶段格式化流程:先对语义元素进行显式分类,再转换为Markdown格式。关键优势在于,这种基于模型的方法具有内在可扩展性,而启发式方法的改进路径有限。在包含7,887个标注网页的基准测试集MainWebBench上,MinerU-HTML的ROUGE-N F1值达到81.8%,显著优于Trafilatura的63.6%,且在结构化元素保留方面表现卓越(代码块90.9%,公式94.0%)。基于该技术,我们从两份Common Crawl快照构建了AICC多语言语料库(规模达7.3万亿词元)。在严格控制预训练实验中,对AICC与Trafilatura提取的TfCC施加相同过滤后,使用AICC(620亿词元)训练的模型在13个基准测试中平均准确率达50.8%,较TfCC提升1.08个百分点,这为"提取质量显著影响模型能力"提供了直接证据。AICC在关键基准上也超越了RefinedWeb与FineWeb。我们公开发布了MainWebBench、MinerU-HTML和AICC,证明HTML提取是网络语料构建中至关重要却常被低估的环节。
English
While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.