어디든지 가리키면 읽기: Tree-of-Lens Grounding을 활용한 레이아웃 인식 GUI 화면 읽기

초록

그래픽 사용자 인터페이스(GUI)는 디지털 기기와의 상호작용에서 핵심적인 역할을 합니다. 최근 다양한 GUI 이해 작업을 위한 모델을 구축하려는 노력이 증가하고 있습니다. 그러나 이러한 노력은 중요한 GUI 참조 작업인 사용자가 지정한 지점을 기반으로 화면을 읽는 작업, 즉 '스크린 포인트 앤 리드(SPR)' 작업을 크게 간과하고 있습니다. 이 작업은 주로 경직된 접근성 화면 읽기 도구에 의해 처리되며, 다중 모달 대형 언어 모델(MLLM)의 발전에 의해 주도되는 새로운 모델이 절실히 필요합니다. 본 논문에서는 SPR 작업을 해결하기 위해 새로운 ToL(Tree-of-Lens) 기반 메커니즘을 활용한 ToL 에이전트를 제안합니다. 입력된 지점 좌표와 해당 GUI 스크린샷을 기반으로, 우리의 ToL 에이전트는 계층적 레이아웃 트리를 구성합니다. 이 트리를 기반으로 ToL 에이전트는 지정된 영역의 내용을 이해할 뿐만 아니라 요소 간의 레이아웃과 공간적 관계를 명확히 설명합니다. 이러한 레이아웃 정보는 화면의 정보를 정확히 해석하는 데 필수적이며, 이는 ToL 에이전트를 다른 화면 읽기 도구와 구별짓는 특징입니다. 또한, 우리는 새로 제안된 SPR 벤치마크에서 ToL 에이전트를 다른 기준 모델들과 철저히 비교 평가합니다. 이 벤치마크는 모바일, 웹, 운영체제의 GUI를 포함합니다. 마지막으로, ToL 에이전트를 모바일 GUI 탐색 작업에서 테스트하여, 에이전트 실행 경로 상의 잘못된 동작을 식별하는 데 있어 그 유용성을 입증합니다. 코드와 데이터는 screen-point-and-read.github.io에서 확인할 수 있습니다.

English

Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (SPR) task. This task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the SPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements. Such layout information is crucial for accurately interpreting information on the screen, distinguishing our ToL agent from other screen reading tools. We also thoroughly evaluate the ToL agent against other baselines on a newly proposed SPR benchmark, which includes GUIs from mobile, web, and operating systems. Last but not least, we test the ToL agent on mobile GUI navigation tasks, demonstrating its utility in identifying incorrect actions along the path of agent execution trajectories. Code and data: screen-point-and-read.github.io

어디든지 가리키면 읽기: Tree-of-Lens Grounding을 활용한 레이아웃 인식 GUI 화면 읽기

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

초록

Support