VisionLLM: 대규모 언어 모델은 비전 중심 작업을 위한 개방형 디코더이기도 하다

초록

대형 언어 모델(LLMs)은 사용자 맞춤형 작업에 대한 인상적인 제로샷 능력으로 인해 인공 일반 지능(AGI)으로의 진전을 크게 가속화했으며, 다양한 응용 분야에서 막대한 잠재력을 지니고 있습니다. 그러나 컴퓨터 비전 분야에서는 강력한 비전 기반 모델(VFMs)이 많이 존재함에도 불구하고, 이러한 모델들은 여전히 미리 정의된 형태의 작업에 제한되어 있으며, LLMs의 개방형 작업 능력을 따라가기 어려운 상황입니다. 본 연구에서는 비전 중심 작업을 위한 LLM 기반 프레임워크인 VisionLLM을 제시합니다. 이 프레임워크는 이미지를 외국어로 간주하고 비전 중심 작업을 언어 작업과 정렬함으로써 비전과 언어 작업에 대한 통합된 관점을 제공합니다. 이를 통해 언어 지시를 사용하여 유연하게 정의하고 관리할 수 있는 작업에 대해 LLM 기반 디코더가 적절한 예측을 수행할 수 있습니다. 광범위한 실험을 통해 제안된 VisionLLM이 언어 지시를 통해 세밀한 객체 수준에서부터 거친 작업 수준까지 다양한 수준의 작업 맞춤화를 달성할 수 있으며, 모두 좋은 결과를 보여줌을 확인했습니다. 특히, 일반적인 LLM 기반 프레임워크를 사용하여 우리의 모델이 COCO 데이터셋에서 60% 이상의 mAP를 달성할 수 있었으며, 이는 탐지 전용 모델과 동등한 성능입니다. 우리는 이 모델이 일반적인 비전과 언어 모델에 대한 새로운 기준을 제시할 수 있기를 바랍니다. 데모는 https://github.com/OpenGVLab/InternGPT를 기반으로 공개될 예정이며, 코드는 https://github.com/OpenGVLab/VisionLLM에서 공개될 예정입니다.

English

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60\% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on https://github.com/OpenGVLab/InternGPT. The code shall be released at https://github.com/OpenGVLab/VisionLLM.

VisionLLM: 대규모 언어 모델은 비전 중심 작업을 위한 개방형 디코더이기도 하다

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

초록

Support