通過通用語言界面實現通用視覺Transformer：GiT

摘要

本文提出了一個名為 GiT 的簡單而有效的框架，僅使用基本的 ViT 即可同時應用於各種視覺任務。受到多層Transformer架構（例如GPT）在大型語言模型（LLMs）中被廣泛使用的普遍性的啟發，我們試圖擴大其範圍，以作為一個強大的視覺基礎模型（VFM）。然而，與語言建模不同，視覺任務通常需要特定的模塊，例如用於檢測的邊界框頭和用於分割的像素解碼器，這大大阻礙了在視覺領域應用強大的多層Transformer。為了解決這個問題，我們設計了一個通用的語言接口，使成功的自回歸解碼能夠巧妙地統一各種視覺任務，從圖像級的理解（例如字幕生成），到稀疏感知（例如檢測），再到密集預測（例如分割）。基於上述設計，整個模型僅由ViT組成，沒有任何特定的添加，提供了顯著的架構簡化。GiT是一個多任務視覺模型，跨越五個代表性基準進行聯合訓練，無需特定任務的微調。有趣的是，我們的GiT在通用性能方面建立了一個新的基準，並促進了任務之間的相互增強，從而相較於獨立訓練實現了顯著的改進。這反映了在LLMs中觀察到的類似影響。通過將訓練豐富化為27個數據集，GiT在各種任務上實現了強大的零-shot結果。由於其簡單的設計，這種範式有望縮小視覺和語言之間的架構差距。代碼和模型將在https://github.com/Haiyang-W/GiT 提供。

English

This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at https://github.com/Haiyang-W/GiT.

通過通用語言界面實現通用視覺Transformer：GiT

GiT: Towards Generalist Vision Transformer through Universal Language Interface

摘要

Support