ChatPaper.aiChatPaper

框中之界:文化交融作为视觉语言模型的新挑战

World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

November 27, 2025
作者: Eunsu Kim, Junyeong Park, Na Min An, Junseong Kim, Hitesh Laxmichand Patel, Jiho Jin, Julia Kruk, Amit Agarwal, Srikant Panda, Fenal Ashokbhai Ilasariya, Hyunjung Shim, Alice Oh
cs.AI

摘要

在全球化的背景下,源自不同文化的元素常常会出现在同一视觉场景中。我们将这类现象称为文化混合场景,然而大型视觉语言模型对其的感知机制仍待探索。本研究将文化混合视为LVLMs面临的关键挑战,系统考察了当多地域文化物品共存时现有模型的表现。为系统分析模型行为,我们构建了CultureMix——一个包含2.3万张扩散生成并经人工核验的文化混合图像的食物视觉问答基准数据集,涵盖四个子任务:(1)纯食物、(2)食物+食物、(3)食物+背景、(4)食物+食物+背景。通过对10个LVLMs的评估,发现模型在混合场景中持续存在文化身份识别失效问题。模型表现出强烈的背景依赖倾向,当纯食物基线添加文化背景后准确率下降14%,且对相同食物在不同语境下会生成矛盾判断。针对这些局限,我们探索了三种鲁棒性提升策略。实验表明,采用多样化文化混合数据集进行监督微调可显著提升模型一致性并降低背景敏感性。我们呼吁学界重视文化混合场景研究,将其作为开发能可靠运用于多元文化现实环境的LVLMs的关键步骤。
English
In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.
PDF61December 2, 2025