RAVENEA：多模态检索增强视觉文化理解的基准测试

摘要

随着视觉语言模型（VLMs）日益融入日常生活，准确理解视觉文化的需求变得至关重要。然而，这些模型在有效解读文化细微差别方面常常表现不足。先前的研究已证明，在纯文本环境中，检索增强生成（RAG）对提升文化理解的有效性，但其在多模态场景中的应用仍待深入探索。为填补这一空白，我们推出了RAVENEA（检索增强视觉文化理解），这是一个旨在通过检索推进视觉文化理解的新基准，聚焦于两项任务：文化导向的视觉问答（cVQA）和文化感知的图像描述（cIC）。RAVENEA通过整合由人工标注者精心挑选并排序的超过10,000份维基百科文档，扩展了现有数据集。利用RAVENEA，我们为每幅图像查询训练并评估了七种多模态检索器，并测量了检索增强输入对十四种最先进VLMs的下游影响。结果显示，轻量级VLMs在结合文化感知检索后，其表现超越了未增强的版本（在cVQA上至少提升3.2%，在cIC上至少提升6.2%）。这凸显了检索增强方法及文化包容性基准在多模态理解中的价值。

English

As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.