細粒度視覺與語言理解的進展衡量

摘要

儘管在網絡大規模圖像文本數據上進行預訓練促進了許多視覺語言（V&L）任務的快速進展，但最近的研究表明預訓練模型缺乏“細粒度”理解，例如在圖像中識別關係、動詞和數字的能力。這導致社區對開發新的基準或具有這些能力的模型產生了更大的興趣。為了更好地理解和量化在這方面的進展，我們對四個細粒度基準上的四個競爭性V&L模型進行了研究。通過我們的分析，我們發現X-VLM（曾等人，2022年）在性能上始終優於其他基準，並且建模創新可能比擴展網絡數據對性能的影響更大，有時甚至會降低性能。通過對X-VLM的深入研究，我們強調了新型損失和豐富數據來源對學習細粒度技能的重要性。最後，我們檢查了訓練動態，發現對於某些任務，性能在訓練初期達到峰值或明顯波動，永遠無法收斂。

English

While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

細粒度視覺與語言理解的進展衡量

Measuring Progress in Fine-grained Vision-and-Language Understanding

摘要

Support