ReferEverything: 動画内で話題にできるすべてをセグメンテーションする方向へ

要旨

私たちは、自然言語で説明できる様々なコンセプトをビデオでセグメンテーションするためのフレームワークであるREMを提案します。当社の手法は、インターネットスケールのデータセットで学習したビデオ拡散モデルによって獲得されたビジュアル言語表現を活用しています。当社のアプローチの重要な洞察の一つは、生成モデルの元の表現を可能な限り保持しつつ、狭いドメインの参照オブジェクトセグメンテーションデータセットで微調整することです。その結果、当社のフレームワークは、限られたカテゴリのオブジェクトマスクで訓練されているにも関わらず、希少で未知のオブジェクトを正確にセグメンテーションおよびトラッキングできます。さらに、Referral Video Process Segmentation（Ref-VPS）の新しく導入されたベンチマークで示されているように、波が海岸に押し寄せるなどの非オブジェクトの動的コンセプトにも汎化できます。当社の実験では、REMがRef-DAVISなどのインドメインデータセットで最先端の手法と同等の性能を発揮する一方、インターネットスケールの事前学習の力を活用して、アウトオブドメインデータにおいて領域の類似性で最大12ポイントまで他を上回ることが示されています。

English

We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method capitalizes on visual-language representations learned by video diffusion models on Internet-scale datasets. A key insight of our approach is preserving as much of the generative model's original representation as possible, while fine-tuning it on narrow-domain Referral Object Segmentation datasets. As a result, our framework can accurately segment and track rare and unseen objects, despite being trained on object masks from a limited set of categories. Additionally, it can generalize to non-object dynamic concepts, such as waves crashing in the ocean, as demonstrated in our newly introduced benchmark for Referral Video Process Segmentation (Ref-VPS). Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to twelve points in terms of region similarity on out-of-domain data, leveraging the power of Internet-scale pre-training.

ReferEverything: 動画内で話題にできるすべてをセグメンテーションする方向へ

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

要旨

Support