ScreenAI：一個用於理解使用者介面和信息圖的視覺語言模型

摘要

螢幕使用者介面（UI）和資訊圖表在人類溝通和人機互動中扮演重要角色，具有相似的視覺語言和設計原則。我們介紹了ScreenAI，一個專注於理解UI和資訊圖表的視覺語言模型。我們的模型在PaLI架構的基礎上進行了改進，採用了pix2struct的靈活補丁策略，並在獨特的數據集混合上進行了訓練。這個混合數據集的核心是一個新穎的螢幕標註任務，模型必須識別UI元素的類型和位置。我們使用這些文本標註來描述螢幕給大型語言模型，並在規模上自動生成問答（QA）、UI導航和摘要訓練數據集。我們進行消融研究以展示這些設計選擇的影響。僅使用5B參數，ScreenAI在UI和資訊圖表相關任務（多頁DocVQA、WebSRC、MoTIF和Widget標題）上取得了新的最先進結果，並與相似大小的模型相比在其他任務（圖表QA、DocVQA和資訊圖表QA）上表現最佳。最後，我們釋出了三個新數據集：一個專注於螢幕標註任務，另外兩個專注於問答。

English

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

ScreenAI：一個用於理解使用者介面和信息圖的視覺語言模型

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

摘要

Support