GiT：ユニバーサル言語インターフェースによる汎用ビジョントランスフォーマーへのアプローチ

要旨

本論文は、シンプルでありながら効果的なフレームワークであるGiTを提案する。これは、標準的なViTのみを用いて、様々な視覚タスクに同時に適用可能である。大規模言語モデル（LLMs）で広く使用されている多層Transformerアーキテクチャ（例：GPT）の普遍性に着想を得て、我々はその適用範囲を拡大し、強力な視覚基盤モデル（VFM）として機能させることを目指す。しかし、言語モデリングとは異なり、視覚タスクでは通常、検出のためのバウンディングボックスヘッドやセグメンテーションのためのピクセルデコーダなど、特定のモジュールが必要とされるため、多層Transformerの視覚領域への応用が大きく妨げられてきた。これを解決するため、我々は普遍的な言語インターフェースを設計し、自己回帰デコーディングを成功させ、画像レベルの理解（例：キャプショニング）、疎な知覚（例：検出）、密な予測（例：セグメンテーション）といった様々な視覚タスクを巧みに統合する。上記の設計に基づき、モデル全体はViTのみで構成され、特定の追加モジュールなしで、驚くべきアーキテクチャの簡素化を実現している。GiTはマルチタスク視覚モデルであり、5つの代表的なベンチマークをタスク固有のファインチューニングなしで共同訓練する。興味深いことに、我々のGiTはジェネラリスト性能において新たなベンチマークを築き、タスク間の相互強化を促進し、孤立した訓練と比較して大幅な改善をもたらす。これはLLMsで観察された影響と類似している。さらに27のデータセットで訓練を強化することで、GiTは様々なタスクにおいて強力なゼロショット結果を達成する。そのシンプルな設計により、このパラダイムは視覚と言語のアーキテクチャギャップを縮める可能性を秘めている。コードとモデルはhttps://github.com/Haiyang-W/GiTで公開予定である。

English

This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at https://github.com/Haiyang-W/GiT.

GiT：ユニバーサル言語インターフェースによる汎用ビジョントランスフォーマーへのアプローチ

GiT: Towards Generalist Vision Transformer through Universal Language Interface

要旨

Summary

Support

Support