全視覺計畫：邁向全視域視覺辨識與理解開放世界

摘要

我們介紹全視（AS）計畫：一個用於識別和理解開放世界中所有事物的大規模數據和模型。利用一個可擴展的數據引擎，該引擎融合了人類反饋和高效模型，我們創建了一個新的數據集（AS-1B），其中標註有超過10億個區域，並附有語義標籤、問答對和詳細說明。該數據集涵蓋了現實世界中3.5百萬個常見和罕見概念，並包含了描述這些概念及其屬性的1322億個標記。利用這個新數據集，我們開發了全視模型（ASM），這是一個統一的框架，用於全景視覺識別和理解。該模型是通過開放式語言提示和位置訓練的，這使得它能夠推廣到各種視覺和語言任務，具有卓越的零樣本性能，包括區域-文本檢索、區域識別、說明和問答。我們希望這個計畫能夠成為視覺語言人工通用智能研究的基礎。模型和數據集將在https://github.com/OpenGVLab/All-Seeing釋出，演示可在https://huggingface.co/spaces/OpenGVLab/all-seeing中查看。

English

We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.

全視覺計畫：邁向全視域視覺辨識與理解開放世界

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

摘要

Support