ChatPaper.aiChatPaper

PaddleOCR-VL:借助0.9B超紧凑视觉语言模型增强多语言文档解析能力

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

October 16, 2025
作者: Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma
cs.AI

摘要

在本报告中,我们提出了PaddleOCR-VL,一种专为文档解析设计的先进且资源高效的模型。其核心组件是PaddleOCR-VL-0.9B,一个紧凑而强大的视觉语言模型(VLM),它整合了NaViT风格的动态分辨率视觉编码器与ERNIE-4.5-0.3B语言模型,以实现精确的元素识别。这一创新模型高效支持109种语言,并在识别复杂元素(如文本、表格、公式和图表)方面表现出色,同时保持最低的资源消耗。通过在广泛使用的公共基准和内部基准上的全面评估,PaddleOCR-VL在页面级文档解析和元素级识别方面均达到了先进水平。它显著优于现有解决方案,与顶级VLM相比展现出强劲竞争力,并提供快速的推理速度。这些优势使其非常适合在实际场景中的部署应用。
English
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
PDF605October 17, 2025