メタパーソナライゼーションによるビデオ内の名前付きインスタンス検索のための視覚言語モデル

要旨

大規模な視覚言語モデル（VLM）は、言語ガイド付き検索アプリケーションにおいて印象的な結果を示しています。これらのモデルはカテゴリレベルのクエリを可能にしますが、現在のところ、「私の犬ビスケット」のような特定のオブジェクトインスタンスが登場するビデオの瞬間をパーソナライズして検索するには課題があります。この問題に対処するため、以下の3つの貢献を提示します。まず、事前学習済みのVLMをメタパーソナライズする方法、つまり、テスト時にVLMをパーソナライズしてビデオ内を検索する方法を学習する手法を説明します。この手法では、各インスタンスに固有の新しい単語埋め込みを学習することで、VLMのトークン語彙を拡張します。インスタンス固有の特徴のみを捉えるため、各インスタンス埋め込みを共有されたグローバルカテゴリ特徴と学習された特徴の組み合わせとして表現します。次に、明示的な人間の監督なしでこのようなパーソナライズを学習することを提案します。このアプローチでは、VLMの埋め込み空間におけるトランスクリプトと視覚言語の類似性を使用して、ビデオ内の名前付き視覚インスタンスの瞬間を自動的に特定します。最後に、パーソナルビデオインスタンス検索ベンチマークであるThis-Is-Myを紹介します。This-Is-MyとDeepFashion2でこのアプローチを評価し、後者のデータセットにおいて最新技術に対して15%の相対的改善を達成することを示します。

English

Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as ``My dog Biscuit'' appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and DeepFashion2 and show that we obtain a 15% relative improvement over the state of the art on the latter dataset.

メタパーソナライゼーションによるビデオ内の名前付きインスタンス検索のための視覚言語モデル

Meta-Personalizing Vision-Language Models to Find Named Instances in Video

要旨

Support