PIRA-Bench：反応的GUIエージェントからGUIベースの能動的意図推薦エージェントへの移行

要旨

現在のグラフィカルユーザーインターフェース（GUI）エージェントは、主に反応的なパラダイムの下で動作している。つまり、ユーザーがエージェントにタスクを実行させるためには、明示的な指示を与える必要がある。しかし、知能的なAIアシスタントは能動的であるべきであり、モバイルやデスクトップのスクリーンショットのような連続的な視覚入力を直接読み取り、ユーザーの意図を予測し、明示的なプロンプトなしでタイムリーな提案を行う能力を備えている必要がある。この能動的パラダイムへの移行には、重大な課題が存在する。現実世界の画面活動は線形的であることは稀であり、ノイズの多いブラウジング、無意味なアクション、マルチスレッドによるタスク切り替えに満ちた長期的な軌跡で構成されている。このギャップを埋めるため、我々はPIRA-Bench（Proactive Intent Recommendation Agent Benchmark）を提案する。これは、連続的で弱教師付きの視覚入力に対するマルチモーダル大規模言語モデル（MLLM）の評価を目的とした新しいベンチマークである。反応的なデータセットとは異なり、PIRA-Benchは、複数の意図が交錯する複雑な軌跡と、様々なユーザープロファイルコンテキストを含むノイズの多いセグメントを特徴とし、エージェントがユーザーの嗜好に合わせながら実行可能なイベントを検出する能力に挑戦する。さらに、我々はPIRFベースラインを提案する。これはメモリを考慮した状態追跡フレームワークであり、汎用MLLMが複数のタスクスレッドを管理し、誤解を招く視覚入力を処理することを可能にする。PIRA-Benchは、堅牢で能動的なGUIベースの個人用アシスタントに向けた第一歩として機能する。

English

Current Graphical User Interface (GUI) agents operate primarily under a reactive paradigm: a user must provide an explicit instruction for the agent to execute a task. However, an intelligent AI assistant should be proactive, which is capable of anticipating user intentions directly from continuous visual inputs, such as mobile or desktop screenshots, and offering timely recommendations without explicit user prompting. Transitioning to this proactive paradigm presents significant challenges. Real-world screen activity is rarely linear; it consists of long-horizon trajectories fraught with noisy browsing, meaningless actions, and multithreaded task-switching. To address this gap, we introduce PIRA-Bench (Proactive Intent Recommendation Agent Benchmark), a novel benchmark for evaluating multimodal large language models (MLLMs) on continuous, weakly-supervised visual inputs. Unlike reactive datasets, PIRA-Bench features complex trajectories with multiple interleaved intents and noisy segments with various user profile contexts, challenging agents to detect actionable events while fitting to user preferences. Furthermore, we propose the PIRF baseline, a memory-aware, state-tracking framework that empowers general MLLMs to manage multiple task threads and handle misleading visual inputs. PIRA-Bench serves as an initial step toward robust and proactive GUI-based personal assistants.

PIRA-Bench：反応的GUIエージェントからGUIベースの能動的意図推薦エージェントへの移行

PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

要旨

Support