ScreenAI：用于理解用户界面和信息图表的视觉-语言模型

摘要

屏幕用户界面（UI）和信息图表在人类交流和人机交互中扮演重要角色，它们共享类似的视觉语言和设计原则。我们介绍了ScreenAI，这是一个专门用于UI和信息图表理解的视觉语言模型。我们的模型在PaLI架构的基础上改进了pix2struct的灵活拼接策略，并在独特混合数据集上进行了训练。这个混合数据集的核心是一项新颖的屏幕注释任务，模型需要识别UI元素的类型和位置。我们利用这些文本注释来描述屏幕给大型语言模型，并自动生成规模化的问答（QA）、UI导航和摘要训练数据集。我们进行消融研究来展示这些设计选择的影响。仅有5B参数的ScreenAI在UI和信息图表任务（多页DocVQA、WebSRC、MoTIF和Widget字幕）上取得了新的最先进结果，并在其他任务（图表QA、DocVQA和信息图表VQA）上相比尺寸相似的模型表现出了最佳性能。最后，我们发布了三个新数据集：一个专注于屏幕注释任务，另外两个专注于问答。

English

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

ScreenAI：用于理解用户界面和信息图表的视觉-语言模型

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

摘要

Support