將元個性化視覺語言模型應用於尋找影片中的具名實例

摘要

大規模視覺語言模型（VLM）在語言導向搜索應用中展現出令人印象深刻的成果。儘管這些模型允許類別級別的查詢，但目前在尋找視頻中特定物件實例（例如“我的狗餅乾”）的個性化搜索方面仍然存在困難。我們提出以下三點貢獻來解決這個問題。首先，我們描述了一種方法來元個性化預訓練的VLM，即在測試時間學習如何個性化VLM以在視頻中進行搜索。我們的方法通過學習特定於每個實例的新詞嵌入來擴展VLM的標記詞彙表。為了僅捕捉實例特定的特徵，我們將每個實例嵌入表示為共享和學習的全局類別特徵的組合。其次，我們提出在沒有明確人類監督的情況下學習進行這種個性化。我們的方法通過在VLM的嵌入空間中使用轉錄和視覺語言相似性自動識別視頻中命名視覺實例的時刻。最後，我們介紹了This-Is-My，一個個人視頻實例檢索基準。我們在This-Is-My和DeepFashion2上評估我們的方法，並展示我們在後者數據集上相對於最新技術取得了15%的改進。

English

Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as ``My dog Biscuit'' appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and DeepFashion2 and show that we obtain a 15% relative improvement over the state of the art on the latter dataset.

將元個性化視覺語言模型應用於尋找影片中的具名實例

Meta-Personalizing Vision-Language Models to Find Named Instances in Video

摘要

Support