将LLaMA解码器调整到视觉Transformer

摘要

本研究探讨了是否可以将最初为大型语言模型（LLMs）设计的仅解码Transformer（如LLaMA）调整为适用于计算机视觉领域。我们首先逐步对标准ViT进行“LLaMA化”，以与LLaMA的架构保持一致，并发现直接将随机掩码应用于自注意力会导致注意力崩溃问题，导致网络训练失败。我们建议通过采用后序类记号技术，将类记号重新定位到图像记号之后，以克服这一挑战，从而使因果自注意力能够高效捕捉整个图像的信息。此外，我们开发了一种软掩码策略，逐渐在训练开始时引入因果掩码到自注意力中，以促进优化行为。定制的模型，被称为图像LLaMA（iLLaMA），在架构上类似于LLaMA，并支持直接监督学习。其因果自注意力提升了计算效率，并通过提升注意力映射排名学习复杂表示。iLLaMA与其仅编码器的对应物相媲美，仅使用570万参数即可实现75.1%的ImageNet top-1准确率。将模型扩展至约310M并在ImageNet-21K上进行预训练进一步提高准确性至86.0%。大量实验证明了iLLaMA的可靠特性：校准、形状-纹理偏差、量化兼容性、ADE20K分割和CIFAR迁移学习。我们希望我们的研究能在LLMs浪潮中为视觉模型设计带来新的视角。预训练模型和代码可在此处获取。

English

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: calibration, shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual model design in the wave of LLMs. Pre-trained models and codes are available here.

将LLaMA解码器调整到视觉Transformer

Adapting LLaMA Decoder to Vision Transformer

摘要

Summary

Support

Support