多模态视觉语言模型中的人为区域适应性调整

摘要

尽管视觉-语言领域在多语言多模态信息融合方面取得了显著成就，但目前仍缺乏专门评估视觉-语言系统人本对齐的框架。针对这一空白，我们提出两项创新贡献：首先，我们提出"人为区域适配"新范式，该范式旨在优化模型对特定区域语境的关联性，同时确保保持全局泛化能力；其次，我们提出名为"地理泛化简易适配法"的轻量级适配方案，通过区域数据筛选与模型融合实现高效适配。基于三大视觉-语言架构（大规模视觉-语言模型、文生图扩散模型、视觉-语言嵌入模型）的系统性实验，以及针对东南亚区域的案例研究表明，人为区域适配具有重要价值，GG-EZ方法能显著提升东南亚文化相关性指标5-15%，在维持98%以上全局性能的同时甚至偶有超越。本研究确立了人为区域对齐作为多模态视觉-语言模型区域化应用的基础范式，并提出了在保持全局泛化前提下优化区域价值对齐的轻量高效基准方法。

English

While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.

多模态视觉语言模型中的人为区域适应性调整

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

摘要

Support