視覺預訓練有助於端對端推理嗎?
Does Visual Pretraining Help End-to-End Reasoning?
July 17, 2023
作者: Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid
cs.AI
摘要
我們旨在研究是否可以通過通用神經網絡的視覺預訓練來實現端到端學習的視覺推理。積極的結果將推翻一個普遍的觀念,即明確的視覺抽象(例如物體檢測)對於視覺推理的合成泛化是必不可少的,並確認神經網絡“通才”解決視覺識別和推理任務的可行性。我們提出了一個簡單且通用的自監督框架,該框架使用變壓器網絡將每個視頻幀“壓縮”為一小組令牌,並基於壓縮的時間上下文重構其餘幀。為了最小化重構損失,網絡必須學習為每個圖像生成一個緊湊的表示,並從時間上下文中捕獲時間動態和對象恆久性。我們在兩個視覺推理基準測試集CATER和ACRE上進行評估。我們觀察到,預訓練對於實現端到端視覺推理的合成泛化至關重要。我們提出的框架在效能上優於傳統監督式預訓練,包括圖像分類和明確的物體檢測,優勢明顯。
English
We aim to investigate whether end-to-end learning of visual reasoning can be
achieved with general-purpose neural networks, with the help of visual
pretraining. A positive result would refute the common belief that explicit
visual abstraction (e.g. object detection) is essential for compositional
generalization on visual reasoning, and confirm the feasibility of a neural
network "generalist" to solve visual recognition and reasoning tasks. We
propose a simple and general self-supervised framework which "compresses" each
video frame into a small set of tokens with a transformer network, and
reconstructs the remaining frames based on the compressed temporal context. To
minimize the reconstruction loss, the network must learn a compact
representation for each image, as well as capture temporal dynamics and object
permanence from temporal context. We perform evaluation on two visual reasoning
benchmarks, CATER and ACRE. We observe that pretraining is essential to achieve
compositional generalization for end-to-end visual reasoning. Our proposed
framework outperforms traditional supervised pretraining, including image
classification and explicit object detection, by large margins.