ChatPaper.aiChatPaper

利用网页界面进行文本丰富的视觉理解

Harnessing Webpage UIs for Text-Rich Visual Understanding

October 17, 2024
作者: Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue
cs.AI

摘要

文本丰富的视觉理解能力——即处理将密集文本内容与视觉内容整合在一起的环境——对于多模态大型语言模型(MLLMs)有效地与结构化环境进行交互至关重要。为了增强这种能力,我们提出使用基于文本的大型语言模型(LLMs)从网页UI中合成通用多模态指令。尽管缺乏直接的视觉输入,基于文本的LLMs能够处理来自网页可访问性树的结构化文本表示。然后将这些指令与UI截图配对,以训练多模态模型。我们引入了一个名为MultiUI的数据集,其中包含来自100万个网站的730万个样本,涵盖了多样的多模态任务和UI布局。在MultiUI上训练的模型不仅在Web UI任务上表现出色——在VisualWebBench上取得高达48%的改进,并在Web代理数据集Mind2Web上的动作准确性提升了19.1%——而且在非Web UI任务甚至是非UI领域(如文档理解、OCR和图表解释)中表现出惊人的泛化能力。这些结果突显了Web UI数据在推动各种场景下文本丰富的视觉理解方面具有广泛的适用性。
English
Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48\% improvement on VisualWebBench and a 19.1\% boost in action accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.

Summary

AI-Generated Summary

PDF322November 16, 2024