Adattamento del Decoder LLaMA al Vision Transformer

Abstract

Questo lavoro esamina se i Transformer decoder-only come LLaMA, originariamente progettati per modelli linguistici di grandi dimensioni (LLM), possano essere adattati al campo della visione artificiale. Iniziamo "LLaMAficando" passo dopo passo un ViT standard per allinearlo all'architettura di LLaMA, e scopriamo che l'applicazione diretta di una maschera causale all'auto-attenzione provoca un collasso dell'attenzione, portando al fallimento dell'addestramento della rete. Proponiamo di riposizionare il token di classe dietro i token dell'immagine con una tecnica di token di classe post-sequenza per superare questa sfida, consentendo all'auto-attenzione causale di catturare efficacemente l'intera informazione dell'immagine. Inoltre, sviluppiamo una strategia di maschera soft che introduce gradualmente una maschera causale all'auto-attenzione all'inizio dell'addestramento per facilitare il comportamento di ottimizzazione. Il modello personalizzato, denominato image LLaMA (iLLaMA), è simile a LLaMA nell'architettura e consente l'apprendimento supervisionato diretto. La sua auto-attenzione causale aumenta l'efficienza computazionale e apprende rappresentazioni complesse elevando i ranghi delle mappe di attenzione. iLLaMA rivaleggia con le prestazioni delle controparti encoder-only, raggiungendo un'accuratezza top-1 su ImageNet del 75,1% con soli 5,7M di parametri. Scalare il modello a ~310M e pre-addestrarlo su ImageNet-21K migliora ulteriormente l'accuratezza all'86,0%. Esperimenti estensivi dimostrano le proprietà affidabili di iLLaMA: calibrazione, bias forma-texture, compatibilità con la quantizzazione, segmentazione ADE20K e transfer learning su CIFAR. Speriamo che il nostro studio possa accendere nuove prospettive sul design dei modelli visivi nell'onda degli LLM. Modelli pre-addestrati e codici sono disponibili qui.

English

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: calibration, shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual model design in the wave of LLMs. Pre-trained models and codes are available here.

Adattamento del Decoder LLaMA al Vision Transformer

Adapting LLaMA Decoder to Vision Transformer

Abstract

Support