ChatPaper.aiChatPaper

OCR智能体:具备能力与记忆反思的智能光学字符识别系统

OCR-Agent: Agentic OCR with Capability and Memory Reflection

February 24, 2026
作者: Shimin Wen, Zeyu Zhang, Xingdou Bian, Hongjie Zhu, Lulu He, Layi Shama, Daji Ergu, Ying Cai
cs.AI

摘要

大型视觉语言模型(VLMs)通过迭代优化方法在复杂视觉理解任务中展现出显著潜力。然而,这类模型普遍缺乏有效的自我修正机制,难以自主纠正认知偏差,导致在多轮修正过程中常陷入重复无效的尝试,无法实现答案质量的稳定提升。针对这一问题,我们提出了一种新型迭代自校正框架,使模型具备双重关键能力:能力反思与记忆反思。该框架引导模型首先通过能力反思诊断错误并制定修正计划,继而借助记忆反思回溯过往尝试以避免重复、探索新方案,最终通过严谨的再推理优化答案。在挑战性基准OCRBench v2上的实验表明,OCR-Agent在英文和中文子集上分别以+2.0和+1.2的分数超越当前开源SOTA模型InternVL3-8B,同时在视觉理解(79.9分)与推理(66.5分)任务中达到顶尖水平——甚至优于规模更大的微调模型。我们的方法证明,结构化的自我感知反思能显著增强VLMs的推理鲁棒性,且无需额外训练。代码地址:https://github.com/AIGeeksGroup/OCR-Agent。
English
Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.
PDF22March 28, 2026