全视项目：走向全视觉识别和理解开放世界

摘要

我们介绍全视（AS）项目：一个用于识别和理解开放世界中一切事物的大规模数据和模型。利用一个可扩展的数据引擎，结合人类反馈和高效模型，我们创建了一个新数据集（AS-1B），其中包含超过10亿个区域，标注有语义标签、问答对和详细说明。它涵盖了现实世界中350万个常见和罕见概念，并包含1322亿个描述概念及其属性的标记。利用这一新数据集，我们开发了全视模型（ASM），一个统一的框架，用于全景视觉识别和理解。该模型通过开放式语言提示和位置进行训练，使其能够推广到各种视觉和语言任务，并表现出卓越的零样本性能，包括区域-文本检索、区域识别、字幕生成和问答。我们希望该项目能够成为视觉-语言人工通用智能研究的基础。模型和数据集将在https://github.com/OpenGVLab/All-Seeing发布，演示可在https://huggingface.co/spaces/OpenGVLab/all-seeing查看。

English

We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.

全视项目：走向全视觉识别和理解开放世界

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

摘要

Support