ChatPaper.aiChatPaper

MemOCR:面向高效长程推理的布局感知视觉记忆系统

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

January 29, 2026
作者: Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen, Qi GU, Hui Su, Xunliang Cai, Xiang Wang, An Zhang
cs.AI

摘要

长程自主推理需要将不断增长的交互历史有效压缩至有限的上下文窗口内。现有记忆系统大多将历史序列化为文本,其令牌级成本固定且随长度线性增长,常将宝贵预算消耗在低价值细节上。为此,我们提出MemOCR——一种多模态记忆智能体,通过视觉布局实现自适应信息密度的内存空间分配,从而提升严格上下文预算下的长程推理能力。具体而言,MemOCR维护结构化富文本记忆(如标题、高亮内容),并将其渲染为可供智能体查询记忆的图像,在视觉层面突出关键证据的同时对辅助细节进行激进压缩。为确保不同内存预算下的鲁棒性,我们采用预算感知的强化学习目标训练MemOCR,使智能体适应多级压缩场景。在长上下文多跳与单跳问答基准测试中,MemOCR显著优于强文本基线,并在极端预算下实现更高效的上下文利用。
English
Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.
PDF82February 3, 2026