InternVL: 비전 기반 모델의 확장 및 일반적인 시각-언어 작업을 위한 정렬

초록

대규모 언어 모델(LLM)의 기하급수적인 성장은 다중 모드 AGI 시스템에 대한 수많은 가능성을 열어주었습니다. 그러나 다중 모드 AGI의 중요한 요소인 비전 및 비전-언어 기반 모델의 발전은 LLM의 속도를 따라가지 못하고 있습니다. 본 연구에서는 비전 기반 모델을 60억 개의 파라미터로 확장하고, 다양한 출처의 웹 규모 이미지-텍스트 데이터를 사용하여 이를 대규모 언어 모델과 점진적으로 정렬하는 대규모 비전-언어 기반 모델(InternVL)을 설계했습니다. 이 모델은 이미지 수준 또는 픽셀 수준 인식과 같은 시각 인식 작업, 제로샷 이미지/비디오 분류, 제로샷 이미지/비디오-텍스트 검색과 같은 비전-언어 작업, 그리고 LLM과 연결하여 다중 모드 대화 시스템을 생성하는 등 다양한 작업에 광범위하게 적용될 수 있으며 최첨단 성능을 달성할 수 있습니다. 우리의 연구가 다중 모드 대규모 모델의 발전에 기여할 수 있기를 바랍니다. 코드와 모델은 https://github.com/OpenGVLab/InternVL에서 확인할 수 있습니다.

English

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the large language model, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

InternVL: 비전 기반 모델의 확장 및 일반적인 시각-언어 작업을 위한 정렬

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

초록

Support