AgentOCR:透過光學自壓縮重塑智慧體歷史
AgentOCR: Reimagining Agent History via Optical Self-Compression
January 8, 2026
作者: Lang Feng, Fuchao Yang, Feng Chen, Xin Cheng, Haiyang Xu, Zhenglin Wan, Ming Yan, Bo An
cs.AI
摘要
近期大型語言模型(LLM)的突破性進展,使得基於多輪互動軌跡進行強化學習(RL)訓練的智能體系統成為可能,但實際部署正面臨文本歷史記錄急遽增長所導致的標記預算與記憶體用量膨脹瓶頸。本文提出AgentOCR框架,透過將累積的觀測-行動歷史以緊湊渲染圖像呈現,充分利用視覺標記更高資訊密度的優勢。為實現可擴展的多輪推演,AgentOCR創新性地提出分段光學快取機制——透過將歷史分解為可雜湊的區段並維護視覺快取,徹底消除冗餘重複渲染。除了固定渲染模式,AgentOCR更引入智能體自適應壓縮技術:智能體主動輸出壓縮率參數,並透過壓縮感知獎勵函數進行訓練,實現任務成功率與標記效率的動態平衡。我們在ALFWorld與基於搜尋的問答系統等具挑戰性的智能體基準測試中展開廣泛實驗。結果顯示,AgentOCR在顯著降低標記消耗量(>50%)的同時,仍能保持超過95%的文本基礎智能體性能,實現了穩定的標記與記憶體效率提升。進一步分析驗證了分段光學快取帶來20倍渲染加速,以及自適應壓縮機制有效的策略平衡能力。
English
Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95\% of text-based agent performance while substantially reducing token consumption (>50\%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.