ChatPaper.aiChatPaper

多模态视觉语言模型中的人为区域适应性调整

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

April 13, 2026
作者: Samuel Cahyawijaya, Peerat Limkonchotiwat, Tack Hwa Wong, Hitesh Laxmichand Patel, Amit Agarwal, Manuel Antonio Rufino, Carlos Rafael Catalan, Muhammad Reza Qorib, Vicky Feliren, Holy Lovenia, Aye Hninn Khine, Frederikus Hudi, David Anugraha, Alham Fikri Aji, Romrawin Chumpu, Viet-Thanh Pham, Minghan Wang, Mohamed Fazli Imam, Ruochen Zhang, Joseph Marvin Imperial, Do Xuan Long, Musa Izzanardi Wijanarko, Joel Ruben Antony Moniz, Patrick Amadeus Irawan, Hanif Muhammad Zhafran, Isaiah Flores, Ira Salsabila, Jun Kevin, Jostin Jerico Rosal, Patricia Nicole Monderin, Kun Kerdthaisong, Ahmad Mustafid, My Chiffon Nguyen, Natchapon Jongwiriyanurak, Siva Worajitwannakul, Haochen Li, Adrian Xuan Wei Lim, Bin Wang, Muhammad Ravi Shulthan Habibi, Lynnette Hui Xian Ng, Mithil Bangera, Yeshil Bangera, Priyaranjan Pattnayak, Dun Li Chan, Sherissa Caren Djuniwar, Hee Ming Shan
cs.AI

摘要

尽管视觉-语言领域在多语言多模态信息融合方面取得了显著成就,但目前仍缺乏专门评估视觉-语言系统人本对齐的框架。针对这一空白,我们提出两项创新贡献:首先,我们提出"人为区域适配"新范式,该范式旨在优化模型对特定区域语境的关联性,同时确保保持全局泛化能力;其次,我们提出名为"地理泛化简易适配法"的轻量级适配方案,通过区域数据筛选与模型融合实现高效适配。基于三大视觉-语言架构(大规模视觉-语言模型、文生图扩散模型、视觉-语言嵌入模型)的系统性实验,以及针对东南亚区域的案例研究表明,人为区域适配具有重要价值,GG-EZ方法能显著提升东南亚文化相关性指标5-15%,在维持98%以上全局性能的同时甚至偶有超越。本研究确立了人为区域对齐作为多模态视觉-语言模型区域化应用的基础范式,并提出了在保持全局泛化前提下优化区域价值对齐的轻量高效基准方法。
English
While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.
PDF41April 17, 2026