任意指向读取：具有“镜头树”的布局感知GUI屏幕阅读基础

摘要

图形用户界面（GUI）是我们与数字设备互动的核心。最近，人们开始努力构建各种GUI理解任务的模型。然而，这些努力在很大程度上忽视了一个重要的GUI相关任务：根据用户指定点进行屏幕阅读，我们称之为屏幕点读（SPR）任务。这一任务主要由僵化的可访问屏幕阅读工具处理，急需由多模态大语言模型（MLLMs）推动的新模型。在本文中，我们提出了一种名为镜头树（ToL）代理的机制，利用了一种新颖的ToL接地机制来解决SPR任务。基于输入点坐标和相应的GUI截图，我们的ToL代理构建了一个分层布局树。基于这棵树，我们的ToL代理不仅理解了指定区域的内容，还表达了元素之间的布局和空间关系。这种布局信息对于准确解释屏幕上的信息至关重要，区别于其他屏幕阅读工具。我们还在新提出的SPR基准上对ToL代理进行了全面评估，该基准包括来自移动设备、Web和操作系统的GUI。最后，我们在移动GUI导航任务上测试了ToL代理，展示了其在识别代理执行轨迹路径上的错误操作中的实用性。代码和数据：screen-point-and-read.github.io

English

Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (SPR) task. This task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the SPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements. Such layout information is crucial for accurately interpreting information on the screen, distinguishing our ToL agent from other screen reading tools. We also thoroughly evaluate the ToL agent against other baselines on a newly proposed SPR benchmark, which includes GUIs from mobile, web, and operating systems. Last but not least, we test the ToL agent on mobile GUI navigation tasks, demonstrating its utility in identifying incorrect actions along the path of agent execution trajectories. Code and data: screen-point-and-read.github.io

任意指向读取：具有“镜头树”的布局感知GUI屏幕阅读基础

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

摘要

Support