LLaMA 디코더를 Vision Transformer에 적용하기

초록

본 연구는 LLaMA와 같은 디코더 전용 트랜스포머(decoder-only Transformer)가 원래 대규모 언어 모델(LLM)을 위해 설계되었음에도 불구하고 컴퓨터 비전 분야에 적용될 수 있는지 여부를 탐구합니다. 먼저, 표준 ViT(Vision Transformer)를 단계적으로 "LLaMA화(LLaMAfy)"하여 LLaMA의 아키텍처와 일치시키고, 케주얼 마스크(casual mask)를 셀프 어텐션(self-attention)에 직접 적용할 경우 어텐션 붕괴(attention collapse) 문제가 발생하여 네트워크 학습이 실패함을 발견했습니다. 이를 해결하기 위해 클래스 토큰(class token)을 이미지 토큰 뒤에 배치하는 포스트 시퀀스 클래스 토큰(post-sequence class token) 기법을 제안하여, 인과적 셀프 어텐션(causal self-attention)이 전체 이미지 정보를 효과적으로 포착할 수 있도록 했습니다. 또한, 학습 초기에 케주얼 마스크를 점진적으로 도입하는 소프트 마스크 전략(soft mask strategy)을 개발하여 최적화 과정을 원활히 진행할 수 있도록 했습니다. 이러한 맞춤형 모델은 이미지 LLaMA(iLLaMA)로 명명되었으며, 아키텍처 측면에서 LLaMA와 유사하고 직접적인 지도 학습(supervised learning)이 가능합니다. 인과적 셀프 어텐션은 계산 효율성을 높이고 어텐션 맵 랭크(attention map rank)를 향상시켜 복잡한 표현을 학습합니다. iLLaMA는 인코더 전용(encoder-only) 모델들과 경쟁력 있는 성능을 보이며, 단 570만 개의 파라미터로 ImageNet top-1 정확도 75.1%를 달성했습니다. 모델을 약 3억 1천만 개의 파라미터로 확장하고 ImageNet-21K에서 사전 학습을 진행한 결과, 정확도는 86.0%로 더욱 향상되었습니다. 다양한 실험을 통해 iLLaMA의 신뢰할 만한 특성들—보정(calibration), 형태-질감 편향(shape-texture bias), 양자화 호환성(quantization compatibility), ADE20K 세그멘테이션 및 CIFAR 전이 학습(transfer learning)—을 입증했습니다. 본 연구가 LLM의 물결 속에서 시각 모델 설계에 대한 새로운 관점을 제공하기를 바랍니다. 사전 학습된 모델과 코드는 여기에서 확인할 수 있습니다.

English

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: calibration, shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual model design in the wave of LLMs. Pre-trained models and codes are available here.

LLaMA 디코더를 Vision Transformer에 적용하기

Adapting LLaMA Decoder to Vision Transformer

초록

Summary

Support

Support