视觉预训练有助于端到端推理吗？

摘要

我们旨在研究是否可以通过通用神经网络结合视觉预训练来实现端到端学习的视觉推理。积极的结果将推翻普遍认为在视觉推理的组合泛化中，显式视觉抽象（例如目标检测）是必不可少的观念，并确认神经网络“通才”解决视觉识别和推理任务的可行性。我们提出了一个简单通用的自监督框架，利用变压器网络将每个视频帧“压缩”为一小组标记，并基于压缩的时间上下文重建其余帧。为了最小化重建损失，网络必须学习每个图像的紧凑表示，同时从时间上下文中捕获时间动态和对象持久性。我们在两个视觉推理基准数据集CATER和ACRE上进行评估。我们观察到，预训练对于实现端到端视觉推理的组合泛化至关重要。我们提出的框架在性能上优于传统的监督预训练，包括图像分类和显式目标检测，优势明显。

English

We aim to investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks, with the help of visual pretraining. A positive result would refute the common belief that explicit visual abstraction (e.g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks. We propose a simple and general self-supervised framework which "compresses" each video frame into a small set of tokens with a transformer network, and reconstructs the remaining frames based on the compressed temporal context. To minimize the reconstruction loss, the network must learn a compact representation for each image, as well as capture temporal dynamics and object permanence from temporal context. We perform evaluation on two visual reasoning benchmarks, CATER and ACRE. We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning. Our proposed framework outperforms traditional supervised pretraining, including image classification and explicit object detection, by large margins.

视觉预训练有助于端到端推理吗？

Does Visual Pretraining Help End-to-End Reasoning?

摘要

Support