細粒度視覚言語理解における進捗の測定

要旨

ウェブ上の大規模な画像テキストデータを用いた事前学習は、多くの視覚と言語（V&L）タスクにおいて急速な進展を可能にしてきました。しかし、最近の研究では、事前学習済みモデルが「細粒度」の理解、例えば画像内の関係性、動詞、数字を認識する能力を欠いていることが示されています。これにより、コミュニティでは、そのような能力を評価するための新しいベンチマークやモデルの開発に対する関心が高まっています。この方向性における進展をより深く理解し定量化するため、我々は4つの競合するV&Lモデルを4つの細粒度ベンチマークで調査しました。分析を通じて、X-VLM（Zeng et al., 2022）が他のベースラインを一貫して上回り、モデリングの革新がウェブデータのスケーリングよりも性能に大きな影響を与えること、さらにはスケーリングが時として性能を低下させることを明らかにしました。X-VLMの詳細な調査を通じて、新しい損失関数と豊富なデータソースの両方が細粒度スキルの学習において重要であることを強調します。最後に、トレーニングダイナミクスを調査し、一部のタスクでは性能がトレーニングの早い段階でピークに達するか、大幅に変動して収束しないことを発見しました。

English

While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

細粒度視覚言語理解における進捗の測定

Measuring Progress in Fine-grained Vision-and-Language Understanding

要旨

Support