PaddleOCR-VL：通过0.9B超紧凑视觉语言模型增强多语言文档解析能力

摘要

在本报告中，我们提出了PaddleOCR-VL，这是一款专为文档解析设计的资源高效且达到业界领先水平（SOTA）的模型。其核心组件是PaddleOCR-VL-0.9B，一个紧凑而强大的视觉-语言模型（VLM），它融合了NaViT风格的动态分辨率视觉编码器与ERNIE-4.5-0.3B语言模型，以实现精准的元素识别。这一创新模型高效支持109种语言，在识别复杂元素（如文本、表格、公式和图表）方面表现卓越，同时保持极低的资源消耗。通过在广泛使用的公共基准测试及内部基准测试上的全面评估，PaddleOCR-VL在页面级文档解析和元素级识别上均实现了SOTA性能。它不仅显著超越现有解决方案，与顶尖VLM相比也展现出强劲竞争力，并提供快速的推理速度。这些优势使其非常适合在实际场景中进行部署应用。

English

In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.