RAVENEA: マルチモーダル検索拡張型視覚文化理解のためのベンチマーク

要旨

視覚言語モデル（VLMs）が日常生活にますます統合されるにつれ、正確な視覚文化理解の必要性が重要となっている。しかし、これらのモデルは文化的ニュアンスを効果的に解釈する点でしばしば不十分である。これまでの研究では、テキストのみの設定において、検索拡張生成（RAG）が文化的理解を向上させる効果を示してきたが、マルチモーダルなシナリオでの応用は未だ十分に検討されていない。このギャップを埋めるため、我々はRAVENEA（Retrieval-Augmented Visual culturE uNdErstAnding）という新しいベンチマークを提案する。これは、検索を通じて視覚文化理解を進めることを目的とし、文化に焦点を当てた視覚的質問応答（cVQA）と文化を考慮した画像キャプション生成（cIC）の2つのタスクに焦点を当てている。RAVENEAは、人間のアノテーターによってキュレーションおよびランク付けされた10,000以上のWikipedia文書を統合することで、既存のデータセットを拡張する。RAVENEAを用いて、各画像クエリに対して7つのマルチモーダル検索器を訓練および評価し、14の最先端VLMにおける検索拡張入力の下流影響を測定した。その結果、文化を意識した検索を組み込むことで、軽量なVLMが非拡張のモデルを上回る（cVQAでは少なくとも3.2%、cICでは6.2%の絶対的な向上）ことが示された。これは、マルチモーダル理解における検索拡張手法と文化的に包括的なベンチマークの価値を強調するものである。

English

As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.

RAVENEA: マルチモーダル検索拡張型視覚文化理解のためのベンチマーク

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

要旨

Support