KAFA：透過知識增強特徵調適重新思考圖像廣告理解與視覺語言模型

摘要

圖像廣告理解是一項具有廣泛現實應用的關鍵任務。儘管涉及各種非典型場景、現實世界實體和對場景文本的推理，解釋圖像廣告的方式相對較少被探討，特別是在具有出色泛化能力和適應性的基礎視覺語言模型（VLMs）時代。在本文中，我們通過預訓練的VLMs的角度進行了第一次對圖像廣告理解的實證研究。我們對將這些VLMs應用於圖像廣告理解中的實際挑戰進行了基準測試和揭示。我們提出了一種簡單的特徵適應策略，以有效地融合圖像廣告的多模態信息，並進一步賦予其對現實世界實體的知識。我們希望我們的研究能引起更多對廣告業廣泛相關的圖像廣告理解的關注。

English

Image ad understanding is a crucial task with wide real-world applications. Although highly challenging with the involvement of diverse atypical scenes, real-world entities, and reasoning over scene-texts, how to interpret image ads is relatively under-explored, especially in the era of foundational vision-language models (VLMs) featuring impressive generalizability and adaptability. In this paper, we perform the first empirical study of image ad understanding through the lens of pre-trained VLMs. We benchmark and reveal practical challenges in adapting these VLMs to image ad understanding. We propose a simple feature adaptation strategy to effectively fuse multimodal information for image ads and further empower it with knowledge of real-world entities. We hope our study draws more attention to image ad understanding which is broadly relevant to the advertising industry.

KAFA：透過知識增強特徵調適重新思考圖像廣告理解與視覺語言模型

KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models

摘要

Support