高品質追蹤任何事物

摘要

在計算機視覺中，視覺物體追蹤是一項基礎的視頻任務。最近，感知算法顯著增強的能力使得單/多物體和基於框/遮罩的追蹤得以統一。其中，「Segment Anything Model」（SAM）引起了廣泛關注。在本報告中，我們提出了一個名為HQTrack的框架，用於在視頻中高質量地追蹤任何物體。HQTrack主要由視頻多物體分割器（VMOS）和遮罩優化器（MR）組成。給定要在視頻的初始幀中追蹤的物體，VMOS將物體遮罩傳播到當前幀。由於VMOS是在幾個最接近的視頻物體分割（VOS）數據集上進行訓練的，對於復雜和角落場景的泛化能力有限，因此這一階段的遮罩結果不夠準確。為了進一步提高追蹤遮罩的質量，我們使用了預訓練的MR模型來優化追蹤結果。作為我們範式有效性的有力證明，HQTrack在視覺物體追蹤和分割（VOTS2023）挑戰中排名第二，而無需使用任何技巧，如測試時數據增強和模型集成。代碼和模型可在https://github.com/jiawen-zhu/HQTrack找到。

English

Visual object tracking is a fundamental video task in computer vision. Recently, the notably increasing power of perception algorithms allows the unification of single/multiobject and box/mask-based tracking. Among them, the Segment Anything Model (SAM) attracts much attention. In this report, we propose HQTrack, a framework for High Quality Tracking anything in videos. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask refiner (MR). Given the object to be tracked in the initial frame of a video, VMOS propagates the object masks to the current frame. The mask results at this stage are not accurate enough since VMOS is trained on several closeset video object segmentation (VOS) datasets, which has limited ability to generalize to complex and corner scenes. To further improve the quality of tracking masks, a pretrained MR model is employed to refine the tracking results. As a compelling testament to the effectiveness of our paradigm, without employing any tricks such as test-time data augmentations and model ensemble, HQTrack ranks the 2nd place in the Visual Object Tracking and Segmentation (VOTS2023) challenge. Code and models are available at https://github.com/jiawen-zhu/HQTrack.

高品質追蹤任何事物

Tracking Anything in High Quality

摘要

Support