ReVision: 通过时间视觉冗余缩减扩展计算机操作智能体

摘要

计算机使用智能体（CUA）依赖对图形用户界面的视觉观察，每张截图会被编码为大量视觉令牌。随着交互轨迹的延长，令牌成本急剧上升，在固定上下文和计算预算下限制了可纳入的历史信息量。与其他领域不同，这导致使用历史信息时性能几乎没有提升或提升极其有限。针对这一低效问题，我们提出ReVision方法——通过训练多模态语言模型处理轨迹数据，利用可学习的补丁选择器比较连续截图中补丁表征，在保留模型所需空间结构的同时移除冗余视觉补丁。在OSWorld、WebTailBench和AgentNetBench三个基准测试中，使用Qwen2.5-VL-7B处理包含5张历史截图的轨迹时，ReVision在无丢弃基线基础上平均减少46%的令牌使用量，同时将成功率提升3%。这建立了清晰的效率增益，使智能体能够用更少的令牌处理更长的轨迹。借助这种改进的效率，我们重新审视历史信息在CUA中的作用，发现当移除冗余后，纳入更多历史观察可持续提升性能。

English

Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.