人間活動に起因する地域適応を実現するマルチモーダル視覚言語モデル

要旨

視覚言語（VL）分野は、複数言語および複数領域にわたる視覚情報とテキスト情報の統合において目覚ましい成功を収めているが、視覚言語システムにおける人間中心的なアラインメントを評価する専用の枠組みは依然として存在しない。本論文はこの課題に対処するため、2つの貢献を行う。第一に、**人間圏地域適応（Anthropogenic Regional Adaptation）** という新たなパラダイムを提案する。これは、グローバルな汎化能力の維持を確保しつつ、特定の地域文脈へのモデルの関連性を最適化することを目的とする。第二に、地域データフィルタリングとモデルマージを利用した、簡潔でありながら効果的な適応手法 **GG-EZ（Geographical-generalization-made-easy）** を提示する。大規模視覚言語モデル、テキストto画像拡散モデル、視覚言語埋め込みモデルという3つのVLアーキテクチャにおける包括的実験と、東南アジア（SEA）地域適応のケーススタディを通じて、人間圏地域適応の重要性とGG-EZの有効性を実証する。SEA全域における文化的関連性指標で5～15%の向上を示しつつ、グローバル性能の98%以上を維持、場合によってはそれを上回る結果を得た。我々の知見は、人間圏地域アラインメントを、多様な地域におけるマルチモーダル視覚言語モデルの適用性に向けた基礎的パラダイムとして確立し、グローバルな汎化を保持しながら地域的な価値アラインメントを最適化する、簡潔かつ効果的なベースライン手法を示すものである。

English

While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.

人間活動に起因する地域適応を実現するマルチモーダル視覚言語モデル

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

要旨

Support