ChatPaper.aiChatPaper

框中之界:文化混合作为视觉语言模型的新挑战

World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

November 27, 2025
作者: Eunsu Kim, Junyeong Park, Na Min An, Junseong Kim, Hitesh Laxmichand Patel, Jiho Jin, Julia Kruk, Amit Agarwal, Srikant Panda, Fenal Ashokbhai Ilasariya, Hyunjung Shim, Alice Oh
cs.AI

摘要

在全球化的當今世界,源自不同文化的元素經常同時出現在單一視覺場景中。我們將此類現象稱為文化混合場景,然而大型視覺語言模型對此的感知機制仍待深入探究。本研究將文化混合視為LVLM面臨的關鍵挑戰,系統性考察當多地域文化元素並置時現有模型的表現。為量化分析此類行為,我們構建了CultureMix基準數據集——包含2.3萬張由擴散模型生成並經人工校驗的飲食文化混合圖像,涵蓋四類子任務:(1)單獨食物、(2)食物+食物、(3)食物+背景、(4)食物+食物+背景。通過評估10個主流LVLM,發現模型在混合場景中普遍難以保持單一文化特徵的識別準確性。模型表現出強烈的背景依賴性,添加文化背景後較純食物基線的準確率下降14%,且對同一食物在不同情境下的預測結果存在不一致性。為突破這些局限,我們探索了三種魯棒性增強策略。實驗表明,採用多樣化文化混合數據集進行監督微調可顯著提升模型一致性並降低背景敏感度。我們呼籲學界重視文化混合場景的研究,這將是開發能可靠服務於多元文化現實環境的LVLM的關鍵一步。
English
In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.
PDF61December 2, 2025