PaddleOCR-VL:通过0.9B超紧凑视觉语言模型增强多语言文档解析能力
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
October 16, 2025
作者: Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma
cs.AI
摘要
在本报告中,我们提出了PaddleOCR-VL,这是一款专为文档解析设计的资源高效且达到业界领先水平(SOTA)的模型。其核心组件是PaddleOCR-VL-0.9B,一个紧凑而强大的视觉-语言模型(VLM),它融合了NaViT风格的动态分辨率视觉编码器与ERNIE-4.5-0.3B语言模型,以实现精准的元素识别。这一创新模型高效支持109种语言,在识别复杂元素(如文本、表格、公式和图表)方面表现卓越,同时保持极低的资源消耗。通过在广泛使用的公共基准测试及内部基准测试上的全面评估,PaddleOCR-VL在页面级文档解析和元素级识别上均实现了SOTA性能。它不仅显著超越现有解决方案,与顶尖VLM相比也展现出强劲竞争力,并提供快速的推理速度。这些优势使其非常适合在实际场景中进行部署应用。
English
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model
tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a
compact yet powerful vision-language model (VLM) that integrates a NaViT-style
dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to
enable accurate element recognition. This innovative model efficiently supports
109 languages and excels in recognizing complex elements (e.g., text, tables,
formulas, and charts), while maintaining minimal resource consumption. Through
comprehensive evaluations on widely used public benchmarks and in-house
benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document
parsing and element-level recognition. It significantly outperforms existing
solutions, exhibits strong competitiveness against top-tier VLMs, and delivers
fast inference speeds. These strengths make it highly suitable for practical
deployment in real-world scenarios.