元个性化视觉语言模型以在视频中找到命名实例

摘要

大规模视觉语言模型（VLM）在语言引导搜索应用中展现出令人印象深刻的成果。虽然这些模型允许类别级别的查询，但目前在针对视频中出现特定对象实例（例如“我的狗饼干”）的个性化搜索方面仍然存在困难。我们提出以下三点贡献来解决这一问题。首先，我们描述了一种元个性化预训练VLM的方法，即学习如何在测试时个性化VLM以在视频中进行搜索。我们的方法通过学习针对每个实例的新颖词嵌入来扩展VLM的标记词汇表。为了仅捕获特定实例的特征，我们将每个实例嵌入表示为共享和学习的全局类别特征的组合。其次，我们提出在没有明确人类监督的情况下学习这种个性化的方法。我们的方法利用VLM嵌入空间中的转录和视觉语言相似性自动识别视频中命名视觉实例的时刻。最后，我们介绍了This-Is-My，一个个人视频实例检索基准。我们在This-Is-My和DeepFashion2上评估了我们的方法，并展示我们在后者数据集上相对于现有技术取得了15%的改进。

English

Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as ``My dog Biscuit'' appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and DeepFashion2 and show that we obtain a 15% relative improvement over the state of the art on the latter dataset.

元个性化视觉语言模型以在视频中找到命名实例

Meta-Personalizing Vision-Language Models to Find Named Instances in Video

摘要

Support