効率的なコンピュータ利用エージェントのためのステップレベル最適化

要旨

コンピュータ利用エージェントは、脆弱なアプリケーション固有の統合に依存する代わりに、任意のグラフィカルユーザーインターフェースと直接対話できるため、汎用ソフトウェア自動化への有望な道筋を提供する。ベンチマーク性能の最近の進歩にもかかわらず、実践的な強力なコンピュータ利用エージェントは、ほとんどのシステムがほぼすべての対話ステップで大規模マルチモーダルモデルを呼び出すため、コストが高く速度が遅いままである。我々は、この均一な計算リソース配分が長期的なGUIタスクにおいて本質的に非効率であると主張する。このような軌跡は非常に不均一であり、多くのステップは日常的でより小型・低コストのポリシーによって確実に処理可能である一方、エラーは比較的少数の高リスク局面に集中する傾向がある。コンピュータ利用ベンチマーク全体を通じて、これらの失敗は繰り返し二つの形態をとる：進行的行き詰まり（エージェントがループする、効果のない行動を繰り返す、または意味のある進展ができない）と、暗黙的な意味的逸脱（エージェントがユーザーの真の目標から既に逸脱した後も、局所的に妥当な行動を取り続ける）である。この非効率性に対処するため、我々は小型ポリシーをデフォルトで実行し、軽量な学習済みモニターがリスク上昇を検知した場合にのみ強力なモデルにエスカレーションする、イベント駆動型のステップレベルカスケードをコンピュータ利用エージェント向けに提案する。本フレームワークは二つの相補的な信号を組み合わせる：最近の推論-行動履歴から進展の低下を検知し回復をトリガーする「行き詰まりモニター」と、意味的に重要なチェックポイントを特定し、意味的逸脱捕捉に疎な検証が最も有効となる箇所で動作する「マイルストーンモニター」である。この設計は、常時動作する先進モデル推論を、進行する対話過程において適応的・オンデマンドな計算リソース配分へと転換する。本フレームワークはモジュール式で実装指向であり、既存のコンピュータ利用エージェントの基盤アーキテクチャを変更したり大規模モデルを再学習することなく、上層に重ねて適用可能である。

English

Computer-use agents provide a promising path toward general software automation because they can interact directly with arbitrary graphical user interfaces instead of relying on brittle, application-specific integrations. Despite recent advances in benchmark performance, strong computer-use agents remain expensive and slow in practice, since most systems invoke large multimodal models at nearly every interaction step. We argue that this uniform allocation of compute is fundamentally inefficient for long-horizon GUI tasks. Such trajectories are highly heterogeneous: many steps are routine and can be handled reliably by smaller, cheaper policies, while errors tend to concentrate at a relatively small number of high-risk moments. Across computer-use benchmarks, these failures repeatedly take two forms: progress stalls, where the agent loops, repeats ineffective actions, or fails to make meaningful progress, and silent semantic drift, where the agent continues taking locally plausible actions after already deviating from the user's true goal. To address this inefficiency, we propose an event-driven, step-level cascade for computer-use agents that runs a small policy by default and escalates to a stronger model only when lightweight learned monitors detect elevated risk. Our framework combines two complementary signals: a Stuck Monitor that detects degraded progress from recent reasoning-action history and triggers recovery, and a Milestone Monitor that identifies semantically meaningful checkpoints where sparse verification is most informative for catching drift. This design turns always-on frontier-model inference into adaptive, on-demand compute allocation over the course of an evolving interaction. The framework is modular and deployment-oriented: it can be layered on top of existing computer-use agents without changing the underlying agent architecture or retraining the large model.

効率的なコンピュータ利用エージェントのためのステップレベル最適化

Step-level Optimization for Efficient Computer-use Agents

要旨

Support