ReVision: 시간적 시각적 중복성 감소를 통한 컴퓨터 사용 에이전트 확장

초록

컴퓨터 사용 에이전트(CUA)는 그래픽 사용자 인터페이스에 대한 시각적 관찰에 의존하며, 각 스크린샷은 많은 수의 시각적 토큰으로 인코딩됩니다. 상호작용 궤적이 길어짐에 따라 토큰 비용이 급격히 증가하여, 고정된 컨텍스트 및 계산 예산 내에서 포함될 수 있는 히스토리 양이 제한됩니다. 이는 다른 도메인과 달리 히스토리를 사용할 때 성능 향상이 없거나 매우 제한적으로 나타나는 결과를 초래했습니다. 우리는 이러한 비효율성을 해결하기 위해 ReVision을 도입합니다. ReVision은 학습된 패치 선택기를 사용하여 연속적인 스크린샷 간의 패치 표현을 비교하고 모델이 요구하는 공간 구조를 유지하면서 중복된 시각적 패치를 제거한 궤적에 대해 다중 모달 언어 모델을 훈련하는 데 사용됩니다. 세 가지 벤치마크(OSWorld, WebTailBench, AgentNetBench)에서 Qwen2.5-VL-7B를 사용하여 5개의 히스토리 스크린샷이 있는 궤적을 처리할 때, ReVision은 토큰 사용량을 평균 46% 감소시키고, 드롭 없는 베이스라인 대비 성공률을 3% 향상시킵니다. 이는 명확한 효율성 향상을 입증하며, 에이전트가 더 적은 토큰으로 더 긴 궤적을 처리할 수 있게 합니다. 이러한 향상된 효율성을 바탕으로, 우리는 CUA에서 히스토리의 역할을 재검토하고, 중복성이 제거될 때 더 많은 과거 관찰이 포함됨에 따라 성능이 지속적으로 향상된다는 것을 발견했습니다.

English

Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.