AgentOCR: 광학적 자기 압축을 통한 에이전트 역사 재구성

초록

대규모 언어 모델(LLM)의 최근 발전으로 다중 턴 상호작용 경로에 걸쳐 강화 학습(RL)으로 훈련된 에이전트 시스템이 가능해졌지만, 토큰 예산과 메모리 사용량을 급증시키는 텍스트 기록의 급속한 증가로 실질적인 배포에 병목 현상이 발생하고 있습니다. 본 연구에서는 축적된 관찰-행동 기록을 압축된 렌더링 이미지로 표현하여 시각적 토큰의 우수한 정보 밀도를 활용하는 AgentOCR 프레임워크를 소개합니다. 다중 턴 롤아웃의 확장성을 위해 AgentOCR은 세그먼트 광학 캐싱을 제안합니다. 기록을 해시 가능한 세그먼트로 분해하고 시각적 캐시를 유지함으로써 이 메커니즘은 중복 재렌더링을 제거합니다. 고정된 렌더링을 넘어 AgentOCR은 에이전트가 압축률을 능동적으로 출력하고 압축 인식 보상으로 훈련되어 작업 성공과 토큰 효율성을 적응적으로 균형 잡는 에이전트 자체 압축을 도입합니다. 우리는 ALFWorld 및 검색 기반 QA라는 도전적인 에이전트 벤치마크에서 광범위한 실험을 수행했습니다. 주목할 만하게, 결과는 AgentOCR이 텍스트 기반 에이전트 성능의 95% 이상을 유지하면서 토큰 사용량을 상당히 절감(>50%)하여 일관된 토큰 및 메모리 효율성을 제공함을 보여줍니다. 추가 분석을 통해 세그먼트 광학 캐싱으로 인한 20배의 렌더링 속도 향상과 자체 압축의 효과적인 전략적 균형 조절이 검증되었습니다.

English

Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95\% of text-based agent performance while substantially reducing token consumption (>50\%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.

AgentOCR: 광학적 자기 압축을 통한 에이전트 역사 재구성

AgentOCR: Reimagining Agent History via Optical Self-Compression

초록

Support