메타 개인화된 비전-언어 모델을 활용한 비디오 내 명명된 인스턴스 탐색

초록

대규모 시각-언어 모델(VLM)은 언어 기반 검색 애플리케이션에서 인상적인 성과를 보여주고 있다. 이러한 모델들은 범주 수준의 질의를 가능하게 하지만, 현재로서는 "내 강아지 비스킷"과 같은 특정 객체 인스턴스가 등장하는 비디오의 순간을 찾는 개인화된 검색에는 어려움을 겪고 있다. 본 연구에서는 이 문제를 해결하기 위해 다음과 같은 세 가지 기여를 제안한다. 첫째, 사전 훈련된 VLM을 메타 개인화하는 방법, 즉 테스트 시점에 비디오 검색을 위해 VLM을 개인화하는 방법을 학습하는 방법을 기술한다. 우리의 방법은 각 인스턴스에 특화된 새로운 단어 임베딩을 학습함으로써 VLM의 토큰 어휘를 확장한다. 인스턴스 특정 기능만을 포착하기 위해, 각 인스턴스 임베딩을 공유된 전역 범주 기능과 학습된 전역 범주 기능의 조합으로 표현한다. 둘째, 명시적인 인간의 감독 없이 이러한 개인화를 학습하는 방법을 제안한다. 우리의 접근 방식은 VLM의 임베딩 공간에서의 트랜스크립트와 시각-언어 유사성을 이용하여 비디오에서 명명된 시각 인스턴스의 순간을 자동으로 식별한다. 마지막으로, 개인 비디오 인스턴스 검색 벤치마크인 This-Is-My를 소개한다. 우리는 This-Is-My와 DeepFashion2 데이터셋에서 우리의 접근 방식을 평가하고, 후자의 데이터셋에서 최신 기술 대비 15%의 상대적 개선을 달성함을 보여준다.

English

Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as ``My dog Biscuit'' appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and DeepFashion2 and show that we obtain a 15% relative improvement over the state of the art on the latter dataset.

메타 개인화된 비전-언어 모델을 활용한 비디오 내 명명된 인스턴스 탐색

Meta-Personalizing Vision-Language Models to Find Named Instances in Video

초록

Support