高品質な物体追跡

要旨

視覚的物体追跡は、コンピュータビジョンにおける基本的な映像タスクである。近年、認識アルゴリズムの著しい進化により、単一/複数物体追跡とボックス/マスクベースの追跡の統合が可能となった。その中でも、Segment Anything Model（SAM）が大きな注目を集めている。本報告では、映像内のあらゆる物体を高品質で追跡するためのフレームワークであるHQTrackを提案する。HQTrackは主に、映像多物体セグメンター（VMOS）とマスクリファイナー（MR）で構成される。映像の初期フレームで追跡対象となる物体が与えられると、VMOSはその物体マスクを現在のフレームに伝播させる。この段階でのマスク結果は十分に正確ではない。なぜなら、VMOSはいくつかのクローズセット映像物体セグメンテーション（VOS）データセットで学習されており、複雑なシーンやコーナーケースへの汎化能力が限られているためである。追跡マスクの品質をさらに向上させるため、事前学習済みのMRモデルを用いて追跡結果を精緻化する。我々のパラダイムの有効性を裏付ける強力な証拠として、テスト時のデータ拡張やモデルアンサンブルといったトリックを一切使用せず、HQTrackはVisual Object Tracking and Segmentation（VOTS2023）チャレンジで2位を獲得した。コードとモデルはhttps://github.com/jiawen-zhu/HQTrackで公開されている。

English

Visual object tracking is a fundamental video task in computer vision. Recently, the notably increasing power of perception algorithms allows the unification of single/multiobject and box/mask-based tracking. Among them, the Segment Anything Model (SAM) attracts much attention. In this report, we propose HQTrack, a framework for High Quality Tracking anything in videos. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask refiner (MR). Given the object to be tracked in the initial frame of a video, VMOS propagates the object masks to the current frame. The mask results at this stage are not accurate enough since VMOS is trained on several closeset video object segmentation (VOS) datasets, which has limited ability to generalize to complex and corner scenes. To further improve the quality of tracking masks, a pretrained MR model is employed to refine the tracking results. As a compelling testament to the effectiveness of our paradigm, without employing any tricks such as test-time data augmentations and model ensemble, HQTrack ranks the 2nd place in the Visual Object Tracking and Segmentation (VOTS2023) challenge. Code and models are available at https://github.com/jiawen-zhu/HQTrack.

高品質な物体追跡

Tracking Anything in High Quality

要旨

Support