KAFA：通过知识增强特征调整视觉-语言模型，重新思考图像广告理解

摘要

图像广告理解是一项具有广泛实际应用的关键任务。尽管涉及多样的非典型场景、现实世界实体以及对场景文本的推理，因此如何解释图像广告相对较少被探讨，特别是在具有出色泛化能力和适应性的基础视觉语言模型（VLMs）时代。在本文中，我们通过预训练的VLMs的视角进行了第一次对图像广告理解的实证研究。我们对将这些VLMs调整到图像广告理解中的实际挑战进行了基准测试和揭示。我们提出了一种简单的特征调整策略，以有效融合图像广告的多模态信息，并进一步赋予其对现实世界实体的知识。我们希望我们的研究能够引起更多对与广告行业广泛相关的图像广告理解的关注。

English

Image ad understanding is a crucial task with wide real-world applications. Although highly challenging with the involvement of diverse atypical scenes, real-world entities, and reasoning over scene-texts, how to interpret image ads is relatively under-explored, especially in the era of foundational vision-language models (VLMs) featuring impressive generalizability and adaptability. In this paper, we perform the first empirical study of image ad understanding through the lens of pre-trained VLMs. We benchmark and reveal practical challenges in adapting these VLMs to image ad understanding. We propose a simple feature adaptation strategy to effectively fuse multimodal information for image ads and further empower it with knowledge of real-world entities. We hope our study draws more attention to image ad understanding which is broadly relevant to the advertising industry.

KAFA：通过知识增强特征调整视觉-语言模型，重新思考图像广告理解

KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models

摘要

Support