백본 전쟁: 컴퓨터 비전 작업 전반에 걸친 대규모 사전 학습 모델 비교

초록

신경망 기반 컴퓨터 비전 시스템은 일반적으로 백본(backbone), 즉 사전 학습된 또는 무작위로 초기화된 특징 추출기를 기반으로 구축됩니다. 몇 년 전만 해도 기본 선택지는 ImageNet으로 학습된 합성곱 신경망(CNN)이었습니다. 그러나 최근에는 다양한 알고리즘과 데이터셋을 사용해 사전 학습된 수많은 백본이 등장했습니다. 이러한 다양한 선택지로 인해 여러 시스템의 성능이 향상되었지만, 실무자들이 어떤 백본을 선택해야 할지 정보에 기반한 결정을 내리기는 어려운 상황입니다. '백본 대전(Battle of the Backbones, BoB)'은 분류부터 객체 탐지, OOD 일반화 등 다양한 컴퓨터 비전 작업에 걸쳐 비전-언어 모델, 자기 지도 학습(self-supervised learning)을 통해 학습된 모델, 그리고 Stable Diffusion 백본을 포함한 다양한 사전 학습 모델을 벤치마킹함으로써 이러한 선택을 더 쉽게 만들어줍니다. 더 나아가, BoB는 1500회 이상의 학습 실행을 통해 종합적으로 분석된 기존 접근법의 강점과 약점을 밝힘으로써 컴퓨터 비전 연구 커뮤니티가 나아가야 할 유망한 방향을 제시합니다. 비전 트랜스포머(ViT)와 자기 지도 학습(SSL)이 점점 더 인기를 끌고 있지만, 우리가 고려한 모델 중에서는 대규모 학습 데이터셋으로 지도 학습 방식으로 사전 학습된 합성곱 신경망이 대부분의 작업에서 여전히 가장 우수한 성능을 보였습니다. 또한, 동일한 아키텍처와 비슷한 규모의 사전 학습 데이터셋을 사용한 직접 비교에서 SSL 백본이 매우 경쟁력 있는 것으로 나타났으며, 이는 향후 연구에서 고급 아키텍처와 더 큰 사전 학습 데이터셋을 사용해 SSL 사전 학습을 수행해야 함을 시사합니다. 우리는 실험의 원시 결과와 연구자들이 자신의 백본을 테스트할 수 있는 코드를 여기에서 공개했습니다: https://github.com/hsouri/Battle-of-the-Backbones

English

Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: https://github.com/hsouri/Battle-of-the-Backbones

백본 전쟁: 컴퓨터 비전 작업 전반에 걸친 대규모 사전 학습 모델 비교

Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

초록

Support