通向通用视觉Transformer的GiT:通过通用语言接口
GiT: Towards Generalist Vision Transformer through Universal Language Interface
March 14, 2024
作者: Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang
cs.AI
摘要
本文提出了一种简单而有效的框架,名为GiT,仅使用基本的ViT即可同时应用于各种视觉任务。受到多层Transformer架构(例如GPT)在大型语言模型(LLMs)中被广泛使用的普适性的启发,我们致力于拓展其范围,以作为强大的视觉基础模型(VFM)。然而,与语言建模不同,视觉任务通常需要特定的模块,例如用于检测的边界框头和用于分割的像素解码器,这极大地阻碍了在视觉领域应用强大的多层Transformer。为了解决这个问题,我们设计了一个通用的语言接口,赋予成功的自回归解码能力,以巧妙地统一各种视觉任务,从图像级理解(例如字幕),到稀疏感知(例如检测),再到密集预测(例如分割)。基于上述设计,整个模型仅由ViT组成,没有任何特定的添加,提供了显著的架构简化。GiT是一个多任务视觉模型,跨越五个代表性基准进行联合训练,无需特定任务的微调。有趣的是,我们的GiT在通用性能方面建立了一个新的基准,并促进了跨任务的相互增强,导致与孤立训练相比的显著改进。这反映了在LLMs中观察到的类似影响。通过对27个数据集进行进一步丰富的训练,GiT在各种任务上实现了强大的零样本结果。由于其简单的设计,这种范式有望缩小视觉和语言之间的架构差距。代码和模型将在https://github.com/Haiyang-W/GiT 上提供。
English
This paper proposes a simple, yet effective framework, called GiT,
simultaneously applicable for various vision tasks only with a vanilla ViT.
Motivated by the universality of the Multi-layer Transformer architecture (e.g,
GPT) widely used in large language models (LLMs), we seek to broaden its scope
to serve as a powerful vision foundation model (VFM). However, unlike language
modeling, visual tasks typically require specific modules, such as bounding box
heads for detection and pixel decoders for segmentation, greatly hindering the
application of powerful multi-layer transformers in the vision domain. To solve
this, we design a universal language interface that empowers the successful
auto-regressive decoding to adeptly unify various visual tasks, from
image-level understanding (e.g., captioning), over sparse perception (e.g.,
detection), to dense prediction (e.g., segmentation). Based on the above
designs, the entire model is composed solely of a ViT, without any specific
additions, offering a remarkable architectural simplification. GiT is a
multi-task visual model, jointly trained across five representative benchmarks
without task-specific fine-tuning. Interestingly, our GiT builds a new
benchmark in generalist performance, and fosters mutual enhancement across
tasks, leading to significant improvements compared to isolated training. This
reflects a similar impact observed in LLMs. Further enriching training with 27
datasets, GiT achieves strong zero-shot results over various tasks. Due to its
simple design, this paradigm holds promise for narrowing the architectural gap
between vision and language. Code and models will be available at
https://github.com/Haiyang-W/GiT.Summary
AI-Generated Summary