KAFA: 視覚言語モデルの知識拡張特徴量適応による画像広告理解の再考

要旨

画像広告の理解は、現実世界での幅広い応用が可能な重要な課題である。多様な非典型的なシーン、実世界のエンティティ、シーンテキストにわたる推論が関わるため非常に困難ではあるが、特に汎用性と適応性に優れた基盤的視覚言語モデル（VLM）の時代において、画像広告をどのように解釈するかは比較的未開拓の領域である。本論文では、事前学習済みVLMの観点から画像広告理解に関する初の実証的研究を行う。これらのVLMを画像広告理解に適応させる際の実践的な課題をベンチマークし、明らかにする。我々は、画像広告のためのマルチモーダル情報を効果的に融合するシンプルな特徴適応戦略を提案し、さらに実世界のエンティティに関する知識を活用して強化する。本研究が、広告業界に広く関連する画像広告理解にさらなる注目を集めることを期待する。

English

Image ad understanding is a crucial task with wide real-world applications. Although highly challenging with the involvement of diverse atypical scenes, real-world entities, and reasoning over scene-texts, how to interpret image ads is relatively under-explored, especially in the era of foundational vision-language models (VLMs) featuring impressive generalizability and adaptability. In this paper, we perform the first empirical study of image ad understanding through the lens of pre-trained VLMs. We benchmark and reveal practical challenges in adapting these VLMs to image ad understanding. We propose a simple feature adaptation strategy to effectively fuse multimodal information for image ads and further empower it with knowledge of real-world entities. We hope our study draws more attention to image ad understanding which is broadly relevant to the advertising industry.

KAFA: 視覚言語モデルの知識拡張特徴量適応による画像広告理解の再考

KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models

要旨

Support