MDPBench：面向真实场景的多语言文档解析基准测试平台

摘要

我们推出多语言文档解析基准测试集，这是首个针对多语言数字化文档与拍摄文档的解析基准。文档解析技术虽已取得显著进展，但几乎完全局限于少数主流语言的整洁、数字化、格式规范的页面。目前尚无系统性基准能评估模型在不同文字体系和低资源语言的数字化及拍摄文档上的表现。MDPBench包含3,400份文档图像，涵盖17种语言、多样化的文字体系以及不同的拍摄条件，并通过专家模型标注、人工校正和人工核查的严格流程生成高质量标注。为确保公平比较并防止数据泄露，我们设置了独立的公开与非公开评估数据集。通过对开源和闭源模型的全面评估，我们发现了惊人现象：闭源模型（特别是Gemini3-Pro）表现出相对稳健的性能，而开源替代模型则出现显著性能滑坡——尤其在非拉丁文字和真实场景拍摄文档上，拍摄文档平均下降17.8%，非拉丁文字平均下降14.0%。这些结果揭示了跨语言和跨条件下的显著性能失衡，并为构建更具包容性、可直接部署的解析系统指明了具体方向。源码详见https://github.com/Yuliang-Liu/MultimodalOCR。

English

We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.

MDPBench：面向真实场景的多语言文档解析基准测试平台

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

摘要

Support