將LLaMA解碼器適應至Vision Transformer。

摘要

本研究探討解碼器專用的Transformer，如LLaMA，最初是為大型語言模型（LLMs）而設計的，是否可以適應計算機視覺領域。我們首先逐步將標準的ViT進行“LLaMA化”，以符合LLaMA的架構，發現直接將隨機遮罩應用於自注意力會導致注意力崩潰問題，進而導致網絡訓練失敗。我們建議將類別標記重新定位到圖像標記後面，採用後序列類別標記技術來克服這一挑戰，從而使因果自注意力能夠有效地捕捉整個圖像的信息。此外，我們開發了一種軟遮罩策略，逐步在訓練開始時引入隨機遮罩到自注意力中，以促進優化行為。定制的模型，被稱為圖像LLaMA（iLLaMA），在架構上類似於LLaMA，並實現直接監督學習。其因果自注意力提升了計算效率，通過提升注意力圖的排名來學習複雜表示。iLLaMA與其僅具編碼器的對應物性能相媲美，僅使用5.7M參數即實現了75.1%的ImageNet top-1精度。將模型擴展至約310M並在ImageNet-21K上進行預訓練進一步提高了精度至86.0%。大量實驗證明了iLLaMA的可靠性特性：校準、形狀-紋理偏差、量化兼容性、ADE20K分割和CIFAR轉移學習。我們希望我們的研究能在LLMs浪潮中點燃對視覺模型設計的新看法。預訓練模型和代碼可在此處獲得。

English

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: calibration, shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual model design in the wave of LLMs. Pre-trained models and codes are available here.

將LLaMA解碼器適應至Vision Transformer。

Adapting LLaMA Decoder to Vision Transformer

摘要

Support