细粒度视觉与语言理解进展的衡量

摘要

尽管在网络大规模图像文本数据上进行预训练已经促进了许多视觉与语言（V&L）任务的快速进展，但最近的研究表明，预训练模型缺乏“细粒度”理解，例如在图像中识别关系、动词和数字的能力。这导致社区对开发新的基准或模型以具备这些能力产生了更大兴趣。为了更好地理解和量化这个方向的进展，我们对四个细粒度基准上的四个竞争性V&L模型进行了调查。通过我们的分析，我们发现X-VLM（曾等人，2022）在性能上始终优于其他基线，并且建模创新可能比扩展网络数据对性能的影响更大，有时甚至会降低性能。通过对X-VLM的深入研究，我们强调了新型损失和丰富数据源对学习细粒度技能的重要性。最后，我们检查了训练动态，并发现对于一些任务，性能在训练早期达到峰值或显著波动，从未收敛。

English

While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

细粒度视觉与语言理解进展的衡量

Measuring Progress in Fine-grained Vision-and-Language Understanding

摘要

Support