MDPBench:面向真实场景的多语言文档解析基准测试平台
MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios
March 30, 2026
作者: Zhang Li, Zhibo Lin, Qiang Liu, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiajun Song, Jiarui Zhang, Xiang Bai, Yuliang Liu
cs.AI
摘要
我们推出多语言文档解析基准测试集(Multilingual Document Parsing Benchmark),这是首个针对多语言数字化文档与拍摄文档解析的基准测试体系。当前文档解析技术虽取得显著进展,但几乎完全集中于少数主流语言的整洁、数字化、格式规范的页面。现有评估体系缺乏对多文字体系及低资源语言的数字化与拍摄文档模型性能的系统性评测标准。MDPBench包含3,400份涵盖17种语言、多种文字体系及不同拍摄条件的文档图像,通过专家模型标注、人工校正与多人核验的严格流程生成高质量标注。为确保公平比较并防止数据泄露,我们设置了独立的公开与非公开评估集。对开源与闭源模型的综合评估揭示了一个惊人发现:闭源模型(特别是Gemini3-Pro)表现出相对稳健的性能,而开源模型则出现显著性能滑坡——尤其在非拉丁文字和真实场景拍摄文档上,拍摄文档平均下降17.8%,非拉丁文字平均下降14.0%。这些结果揭示了跨语言与跨场景的显著性能失衡,为构建更具包容性、可部署的解析系统指明了具体方向。源码详见https://github.com/Yuliang-Liu/MultimodalOCR。
English
We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.