文化褪色处:揭示文生图模型中的文化鸿沟
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
November 21, 2025
作者: Chuancheng Shi, Shangze Li, Shiming Guo, Simiao Xie, Wenhua Wu, Jingtong Dou, Chao Wu, Canran Xiao, Cong Wang, Zifeng Cheng, Fei Shen, Tat-Seng Chua
cs.AI
摘要
多语言文本到图像生成模型在视觉真实性与语义对齐方面进展迅速,现已得到广泛应用。然而其输出结果会随文化语境产生差异:由于语言承载着文化内涵,基于多语言提示词生成的图像应当保持跨语言的文化一致性。我们通过系统性分析发现,当前T2I模型在处理多语言提示时往往产生文化中性或英语文化偏向的结果。对两个代表性模型的剖析表明,该问题并非源于文化知识的缺失,而是文化相关表征的激活不足所致。我们提出一种探测方法,可将文化敏感信号定位至少数固定层中的特定神经元集群。基于此发现,我们引入两种互补的对齐策略:(1)无需微调主干网络的推理时文化激活技术,通过放大已识别神经元的响应强度;(2)层定向文化增强方法,仅更新与文化相关的模型层。在自建CultureBench上的实验表明,该方法在保持生成质量与多样性的同时,相较强基线模型实现了文化一致性的持续提升。
English
Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.