MDPBench: Een Benchmark voor Multidocument Parsing in Realistische Scenario's

Samenvatting

Wij introduceren de Multilingual Document Parsing Benchmark (MDPBench), de eerste benchmark voor meertalige parsing van zowel digitale als gefotografeerde documenten. Documentparsing heeft opmerkelijke vooruitgang geboekt, maar vindt vrijwel uitsluitend plaats op schone, digitale, goed opgemaakte pagina's in een handvol dominante talen. Er bestaat geen systematische benchmark om te evalueren hoe modellen presteren op digitale en gefotografeerde documenten in diverse schriften en talen met weinig bronnen. MDPBench omvat 3.400 documentafbeeldingen verspreid over 17 talen, diverse schriften en uiteenlopende fotografische omstandigheden, met hoogwaardige annotaties die zijn geproduceerd via een rigoureus proces van expertmodel-labeling, handmatige correctie en menselijke verificatie. Om een eerlijke vergelijking te garanderen en datalekken te voorkomen, hanteren we gescheiden openbare en privé-evaluatiesets. Onze uitgebreide evaluatie van zowel open-source als closed-source modellen onthult een opvallende bevinding: hoewel closed-source modellen (met name Gemini3-Pro) relatief robuust blijken, lijden open-source alternatieven onder een dramatische prestatie-inval, vooral voor niet-Latijnse schriften en in het wild gefotografeerde documenten, met een gemiddelde daling van 17,8% op gefotografeerde documenten en 14,0% op niet-Latijnse schriften. Deze resultaten onthullen aanzienlijke prestatieonevenwichtigheden tussen talen en condities, en wijzen concrete richtingen aan voor het bouwen van meer inclusieve, implementatiegerechte parsingsystemen. Bron beschikbaar op https://github.com/Yuliang-Liu/MultimodalOCR.

English

We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.

MDPBench: Een Benchmark voor Multidocument Parsing in Realistische Scenario's

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

Samenvatting

Support