GiT: 범용 언어 인터페이스를 통한 일반주의 비전 트랜스포머

초록

본 논문은 GiT라는 간단하지만 효과적인 프레임워크를 제안하며, 이는 기본적인 ViT(Vision Transformer)만으로 다양한 시각적 작업에 동시에 적용 가능합니다. 대규모 언어 모델(LLMs)에서 널리 사용되는 다층 트랜스포머 아키텍처(예: GPT)의 보편성에 영감을 받아, 이를 강력한 시각적 기반 모델(VFM)로 확장하고자 합니다. 그러나 언어 모델링과 달리, 시각적 작업은 일반적으로 탐지를 위한 바운딩 박스 헤드나 분할을 위한 픽셀 디코더와 같은 특정 모듈을 필요로 하며, 이는 다층 트랜스포머의 시각적 도메인 적용을 크게 방해합니다. 이를 해결하기 위해, 우리는 다양한 시각적 작업(이미지 수준 이해(예: 캡셔닝), 희소 인식(예: 탐지), 밀집 예측(예: 분할))을 능숙하게 통합할 수 있는 자동 회귀 디코딩을 가능하게 하는 보편적인 언어 인터페이스를 설계했습니다. 이러한 설계를 바탕으로, 전체 모델은 특별한 추가 없이 오직 ViT로만 구성되어 있어 놀라운 아키텍처 단순화를 제공합니다. GiT는 다중 작업 시각적 모델로, 작업별 미세 조정 없이 다섯 가지 대표적인 벤치마크에서 공동으로 학습됩니다. 흥미롭게도, GiT는 일반화 성능에서 새로운 벤치마크를 세우고, 작업 간 상호 강화를 촉진하여 개별 학습에 비해 상당한 개선을 이끌어냅니다. 이는 LLMs에서 관찰된 유사한 영향을 반영합니다. 27개의 데이터셋으로 학습을 더욱 풍부하게 하여, GiT는 다양한 작업에서 강력한 제로샷 결과를 달성합니다. 간단한 설계 덕분에, 이 패러다임은 시각과 언어 간의 아키텍처 격차를 좁히는 데 유망합니다. 코드와 모델은 https://github.com/Haiyang-W/GiT에서 제공될 예정입니다.

English

This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at https://github.com/Haiyang-W/GiT.

GiT: 범용 언어 인터페이스를 통한 일반주의 비전 트랜스포머

GiT: Towards Generalist Vision Transformer through Universal Language Interface

초록

Support