ReVision：透過時間視覺冗餘減少來擴展電腦使用代理

摘要

計算機使用代理（CUAs）依賴於對圖形用戶介面的視覺觀察，其中每個螢幕截圖都被編碼為大量視覺標記。隨著互動軌跡的增長，標記成本迅速增加，限制了在固定上下文和計算預算下可納入的歷史資訊量。與其他領域不同，這導致了使用歷史資訊時效能提升極少或完全沒有提升。為了解決這個效率問題，我們引入了 ReVision，它用於在多模態語言模型上訓練軌跡，透過一個學習到的補丁選擇器來移除冗餘的視覺補丁，該選擇器在保留模型所需空間結構的同時，比較連續螢幕截圖之間的補丁表示。在三個基準測試（OSWorld、WebTailBench 和 AgentNetBench）中，當使用 Qwen2.5-VL-7B 處理包含 5 張歷史螢幕截圖的軌跡時，ReVision 平均減少了 46% 的標記使用量，同時成功率比無丟棄基線提升了 3%。這確立了明確的效率增益，使代理能夠用更少的標記處理更長的軌跡。憑藉這種 improved 效率，我們重新審視了歷史資訊在 CUA 中的作用，並發現當移除冗餘後，隨著納入更多過去的觀察結果，效能持續提升。

English

Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.