세분화된 시각-언어 이해에서의 진전 측정

초록

웹에서 대규모 이미지-텍스트 데이터에 대한 사전 학습은 많은 시각 및 언어(V&L) 작업에서 빠른 진전을 이끌어왔지만, 최근 연구에서는 사전 학습된 모델들이 이미지 내의 관계, 동작, 숫자 등을 인식하는 "세부적인" 이해 능력이 부족함을 보여주었습니다. 이로 인해 커뮤니티에서는 이러한 능력을 평가하기 위한 새로운 벤치마크나 모델을 개발하려는 관심이 증가하고 있습니다. 이러한 방향으로의 진전을 더 잘 이해하고 정량화하기 위해, 우리는 네 가지 세부적인 벤치마크에서 네 가지 경쟁력 있는 V&L 모델을 조사했습니다. 우리의 분석을 통해 X-VLM(Zeng et al., 2022)이 다른 기준 모델들을 꾸준히 능가하며, 모델링 혁신이 웹 데이터의 규모 확장보다 성능에 더 큰 영향을 미칠 수 있고, 때로는 오히려 성능을 저하시킬 수도 있음을 발견했습니다. X-VLM에 대한 심층 조사를 통해, 우리는 세부적인 기술을 학습하기 위해 새로운 손실 함수와 풍부한 데이터 소스가 모두 중요함을 강조합니다. 마지막으로, 우리는 학습 동역학을 검토하고, 일부 작업에서는 성능이 학습 초기에 정점을 찍거나 상당히 변동하며 결코 수렴하지 않는다는 사실을 발견했습니다.

English

While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

세분화된 시각-언어 이해에서의 진전 측정

Measuring Progress in Fine-grained Vision-and-Language Understanding

초록

Support