VisionLLM:大型语言模型也是视觉中心任务的开放式解码器。
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
May 18, 2023
作者: Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai
cs.AI
摘要
大型语言模型(LLMs)显著加速了人工通用智能(AGI)的进展,其令人印象深刻的零翻译能力为用户定制任务,赋予它们在各种应用中巨大潜力。然而,在计算机视觉领域,尽管有许多强大的视觉基础模型(VFMs)可用,它们仍然局限于预定义形式的任务,难以匹敌LLMs的开放式任务能力。在这项工作中,我们提出了一个基于LLM的面向视觉任务的框架,称为VisionLLM。该框架通过将图像视为外语,并将视觉中心任务与可以使用语言指令灵活定义和管理的语言任务对齐,为视觉和语言任务提供了统一的视角。然后,基于LLM的解码器可以根据这些指令对开放式任务进行适当的预测。大量实验证明所提出的VisionLLM可以通过语言指令实现不同级别的任务定制化,从细粒度对象级到粗粒度任务级的定制化,均取得良好结果。值得注意的是,使用通用型LLM框架,我们的模型在COCO上可以实现超过60\%的mAP,与特定检测模型持平。我们希望这个模型能为通用型视觉和语言模型设定一个新的基准。演示将基于https://github.com/OpenGVLab/InternGPT发布。代码将在https://github.com/OpenGVLab/VisionLLM发布。
English
Large language models (LLMs) have notably accelerated progress towards
artificial general intelligence (AGI), with their impressive zero-shot capacity
for user-tailored tasks, endowing them with immense potential across a range of
applications. However, in the field of computer vision, despite the
availability of numerous powerful vision foundation models (VFMs), they are
still restricted to tasks in a pre-defined form, struggling to match the
open-ended task capabilities of LLMs. In this work, we present an LLM-based
framework for vision-centric tasks, termed VisionLLM. This framework provides a
unified perspective for vision and language tasks by treating images as a
foreign language and aligning vision-centric tasks with language tasks that can
be flexibly defined and managed using language instructions. An LLM-based
decoder can then make appropriate predictions based on these instructions for
open-ended tasks. Extensive experiments show that the proposed VisionLLM can
achieve different levels of task customization through language instructions,
from fine-grained object-level to coarse-grained task-level customization, all
with good results. It's noteworthy that, with a generalist LLM-based framework,
our model can achieve over 60\% mAP on COCO, on par with detection-specific
models. We hope this model can set a new baseline for generalist vision and
language models. The demo shall be released based on
https://github.com/OpenGVLab/InternGPT. The code shall be released at
https://github.com/OpenGVLab/VisionLLM.