ChatPaper.aiChatPaper

MinerU:用于精确文档内容提取的开源解决方案

MinerU: An Open-Source Solution for Precise Document Content Extraction

September 27, 2024
作者: Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, Conghui He
cs.AI

摘要

文档内容分析一直是计算机视觉中一个关键的研究领域。尽管诸如OCR、布局检测和公式识别等方法取得了显著进展,但现有的开源解决方案在高质量内容提取方面仍存在困难,这是由于文档类型和内容的多样性所致。为了解决这些挑战,我们提出了MinerU,这是一个用于高精度文档内容提取的开源解决方案。MinerU利用先进的PDF-Extract-Kit模型有效地从不同类型的文档中提取内容,并采用精心调整的预处理和后处理规则来确保最终结果的准确性。实验结果表明,MinerU在各种文档类型上始终实现高性能,显著提升了内容提取的质量和一致性。MinerU开源项目可在https://github.com/opendatalab/MinerU 上获取。
English
Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.

Summary

AI-Generated Summary

PDF284November 16, 2024