指し示す場所をどこでも読む：Tree-of-Lens Groundingによるレイアウト認識GUIスクリーンリーディング

要旨

グラフィカルユーザーインターフェース（GUI）は、デジタルデバイスとのインタラクションにおいて中心的な役割を果たしています。最近では、さまざまなGUI理解タスクのためのモデルを構築する取り組みが増えています。しかし、これらの取り組みは、重要なGUI参照タスクである、ユーザーが指定したポイントに基づく画面読み上げ（Screen Point-and-Read: SPRタスク）をほとんど見落としています。このタスクは、主に硬直的なアクセシブル画面読み上げツールによって処理されており、マルチモーダル大規模言語モデル（MLLM）の進歩に基づく新しいモデルが強く求められています。本論文では、SPRタスクに対処するために、新たなTree-of-Lens（ToL）グラウンディングメカニズムを利用したToLエージェントを提案します。入力されたポイント座標と対応するGUIスクリーンショットに基づいて、ToLエージェントは階層的レイアウトツリーを構築します。このツリーに基づいて、ToLエージェントは指定された領域の内容を理解するだけでなく、要素間のレイアウトと空間的関係を明確に説明します。このようなレイアウト情報は、画面上の情報を正確に解釈するために重要であり、ToLエージェントを他の画面読み上げツールと区別する特徴です。また、新たに提案されたSPRベンチマーク（モバイル、ウェブ、オペレーティングシステムのGUIを含む）において、ToLエージェントを他のベースラインと徹底的に評価します。最後に、ToLエージェントをモバイルGUIナビゲーションタスクでテストし、エージェント実行軌跡のパスに沿った誤ったアクションを特定する有用性を実証します。コードとデータ: screen-point-and-read.github.io

English

Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (SPR) task. This task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the SPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements. Such layout information is crucial for accurately interpreting information on the screen, distinguishing our ToL agent from other screen reading tools. We also thoroughly evaluate the ToL agent against other baselines on a newly proposed SPR benchmark, which includes GUIs from mobile, web, and operating systems. Last but not least, we test the ToL agent on mobile GUI navigation tasks, demonstrating its utility in identifying incorrect actions along the path of agent execution trajectories. Code and data: screen-point-and-read.github.io

指し示す場所をどこでも読む：Tree-of-Lens Groundingによるレイアウト認識GUIスクリーンリーディング

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

要旨

Support