시각적 사전 학습이 엔드투엔드 추론에 도움이 되는가?

초록

우리는 시각적 사전 학습의 도움을 받아 일반 목적의 신경망으로 시각적 추론의 종단간 학습이 가능한지 조사하고자 합니다. 긍정적인 결과가 나온다면, 이는 시각적 추론에서의 조합적 일반화를 위해 명시적인 시각적 추상화(예: 객체 탐지)가 필수적이라는 일반적인 믿음을 반박하고, 시각 인식과 추론 과제를 해결할 수 있는 신경망 "전문가"의 가능성을 확인할 것입니다. 우리는 각 비디오 프레임을 트랜스포머 네트워크를 통해 소규모 토큰 집합으로 "압축"하고, 압축된 시간적 맥락을 기반으로 나머지 프레임을 재구성하는 간단하고 일반적인 자기 지도 학습 프레임워크를 제안합니다. 재구성 손실을 최소화하기 위해 네트워크는 각 이미지에 대한 간결한 표현을 학습해야 할 뿐만 아니라 시간적 맥락에서 시간적 역학과 객체의 영속성을 포착해야 합니다. 우리는 CATER와 ACRE라는 두 가지 시각적 추론 벤치마크에서 평가를 수행합니다. 사전 학습이 종단간 시각적 추론을 위한 조합적 일반화를 달성하는 데 필수적이라는 것을 관찰했습니다. 우리가 제안한 프레임워크는 이미지 분류 및 명시적 객체 탐지를 포함한 전통적인 지도 학습 사전 학습을 큰 차이로 능가합니다.

English

We aim to investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks, with the help of visual pretraining. A positive result would refute the common belief that explicit visual abstraction (e.g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks. We propose a simple and general self-supervised framework which "compresses" each video frame into a small set of tokens with a transformer network, and reconstructs the remaining frames based on the compressed temporal context. To minimize the reconstruction loss, the network must learn a compact representation for each image, as well as capture temporal dynamics and object permanence from temporal context. We perform evaluation on two visual reasoning benchmarks, CATER and ACRE. We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning. Our proposed framework outperforms traditional supervised pretraining, including image classification and explicit object detection, by large margins.

시각적 사전 학습이 엔드투엔드 추론에 도움이 되는가?

Does Visual Pretraining Help End-to-End Reasoning?

초록

Support