LLaMAデコーダをVision Transformerに適応させる

要旨

本研究では、LLaMAのようなデコーダのみのTransformerが、元々大規模言語モデル（LLM）向けに設計されたものから、コンピュータビジョン分野に適応できるかどうかを検証します。まず、標準的なViTを段階的に「LLaMA化」し、LLaMAのアーキテクチャに合わせますが、セルフアテンションに直接カジュアルマスクを適用すると、アテンションの崩壊が発生し、ネットワークの学習が失敗することを発見しました。この課題を克服するため、クラストークンを画像トークンの後ろに配置する「ポストシーケンスクラストークン」技術を提案し、因果的セルフアテンションが画像全体の情報を効率的に捕捉できるようにしました。さらに、学習の開始時にカジュアルマスクを徐々に導入するソフトマスク戦略を開発し、最適化の挙動を促進します。このように調整されたモデルは、image LLaMA（iLLaMA）と名付けられ、アーキテクチャ的にはLLaMAに類似しており、直接的な教師あり学習を可能にします。その因果的セルフアテンションは計算効率を向上させ、アテンションマップのランクを高めることで複雑な表現を学習します。iLLaMAは、エンコーダのみのモデルと同等の性能を発揮し、わずか5.7MのパラメータでImageNetのトップ1精度75.1%を達成します。モデルを約310Mにスケールアップし、ImageNet-21Kで事前学習を行うことで、精度はさらに86.0%に向上します。広範な実験により、iLLaMAの信頼性のある特性が示されています：キャリブレーション、形状-テクスチャバイアス、量子化互換性、ADE20Kセグメンテーション、CIFAR転移学習などです。本研究が、LLMの波の中で視覚モデル設計に新たな視点をもたらすことを期待しています。事前学習済みモデルとコードはこちらで公開されています。

English

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: calibration, shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual model design in the wave of LLMs. Pre-trained models and codes are available here.

LLaMAデコーダをVision Transformerに適応させる

Adapting LLaMA Decoder to Vision Transformer

要旨

Support