OCR智能体：具备能力与记忆反思的智能光学字符识别系统

摘要

大型视觉语言模型（VLMs）通过迭代优化方法在复杂视觉理解任务中展现出显著潜力。然而，这类模型普遍缺乏有效的自我修正机制，难以自主纠正认知偏差，导致在多轮修正过程中常陷入重复无效的尝试，无法实现答案质量的稳定提升。针对这一问题，我们提出了一种新型迭代自校正框架，使模型具备双重关键能力：能力反思与记忆反思。该框架引导模型首先通过能力反思诊断错误并制定修正计划，继而借助记忆反思回溯过往尝试以避免重复、探索新方案，最终通过严谨的再推理优化答案。在挑战性基准OCRBench v2上的实验表明，OCR-Agent在英文和中文子集上分别以+2.0和+1.2的分数超越当前开源SOTA模型InternVL3-8B，同时在视觉理解（79.9分）与推理（66.5分）任务中达到顶尖水平——甚至优于规模更大的微调模型。我们的方法证明，结构化的自我感知反思能显著增强VLMs的推理鲁棒性，且无需额外训练。代码地址：https://github.com/AIGeeksGroup/OCR-Agent。

English

Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.

OCR智能体：具备能力与记忆反思的智能光学字符识别系统

OCR-Agent: Agentic OCR with Capability and Memory Reflection

摘要

Support