RAVENEA:多模态检索增强视觉文化理解的基准
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
May 20, 2025
作者: Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders Søgaard, Ivan Vulić, Wenxuan Zhang, Paul Pu Liang, Yang Deng, Serge Belongie
cs.AI
摘要
随着视觉语言模型(VLMs)日益融入日常生活,对精准视觉文化理解的需求变得至关重要。然而,这些模型在有效解读文化细微差别方面常常表现不足。先前的研究已证明,在纯文本环境中,检索增强生成(RAG)在提升文化理解方面具有显著效果,但其在多模态场景中的应用仍待深入探索。为填补这一空白,我们推出了RAVENEA(检索增强视觉文化理解),这是一个旨在通过检索推进视觉文化理解的新基准,聚焦于两项任务:文化导向的视觉问答(cVQA)和文化启发的图像描述(cIC)。RAVENEA通过整合由人工标注者筛选并排序的超过10,000份维基百科文档,扩展了现有数据集。利用RAVENEA,我们为每幅图像查询训练并评估了七种多模态检索器,并测量了检索增强输入对十四种最先进VLMs的下游影响。结果显示,当轻量级VLMs与文化感知检索相结合时,其表现超越了未增强的模型(在cVQA上至少提升3.2%,在cIC上提升6.2%)。这凸显了检索增强方法及文化包容性基准在多模态理解中的价值。
English
As vision-language models (VLMs) become increasingly integrated into daily
life, the need for accurate visual culture understanding is becoming critical.
Yet, these models frequently fall short in interpreting cultural nuances
effectively. Prior work has demonstrated the effectiveness of
retrieval-augmented generation (RAG) in enhancing cultural understanding in
text-only settings, while its application in multimodal scenarios remains
underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented
Visual culturE uNdErstAnding), a new benchmark designed to advance visual
culture understanding through retrieval, focusing on two tasks: culture-focused
visual question answering (cVQA) and culture-informed image captioning (cIC).
RAVENEA extends existing datasets by integrating over 10,000 Wikipedia
documents curated and ranked by human annotators. With RAVENEA, we train and
evaluate seven multimodal retrievers for each image query, and measure the
downstream impact of retrieval-augmented inputs across fourteen
state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented
with culture-aware retrieval, outperform their non-augmented counterparts (by
at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the
value of retrieval-augmented methods and culturally inclusive benchmarks for
multimodal understanding.Summary
AI-Generated Summary