隨處閱讀：具有布局感知的GUI螢幕閱讀與Tree-of-Lens基礎定位

摘要

圖形使用者介面（GUI）對我們與數位設備的互動至關重要。最近，人們開始努力建立各種GUI理解任務的模型。然而，這些努力在很大程度上忽略了一個重要的GUI相關任務：根據使用者指示的點進行螢幕閱讀，我們稱之為螢幕點讀（SPR）任務。這個任務主要由僵化的可存取螢幕閱讀工具處理，迫切需要由多模式大型語言模型（MLLMs）驅動的新模型。在本文中，我們提出了一個名為鏡片樹（ToL）代理，利用一種新穎的ToL基礎機制來應對SPR任務。根據輸入的點座標和相應的GUI螢幕截圖，我們的ToL代理構建了一個階層佈局樹。基於這個樹，我們的ToL代理不僅理解了指示區域的內容，還闡述了元素之間的佈局和空間關係。這樣的佈局信息對於準確解讀螢幕上的信息至關重要，使我們的ToL代理與其他螢幕閱讀工具有所不同。我們還對ToL代理在新提出的SPR基準測試中與其他基準進行了全面評估，該基準包括來自移動、網頁和操作系統的GUI。最後，我們在移動GUI導航任務上測試了ToL代理，展示了其在識別執行軌跡中路徑上的錯誤操作方面的實用性。程式碼和數據：screen-point-and-read.github.io

English

Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (SPR) task. This task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the SPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements. Such layout information is crucial for accurately interpreting information on the screen, distinguishing our ToL agent from other screen reading tools. We also thoroughly evaluate the ToL agent against other baselines on a newly proposed SPR benchmark, which includes GUIs from mobile, web, and operating systems. Last but not least, we test the ToL agent on mobile GUI navigation tasks, demonstrating its utility in identifying incorrect actions along the path of agent execution trajectories. Code and data: screen-point-and-read.github.io

隨處閱讀：具有布局感知的GUI螢幕閱讀與Tree-of-Lens基礎定位

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

摘要

Support