高质量跟踪任何事物

摘要

视觉目标跟踪是计算机视觉中的一项基本视频任务。最近，感知算法显著增强的能力使得单/多目标和基于框/蒙版的跟踪得以统一。其中，Segment Anything Model（SAM）引起了广泛关注。在本报告中，我们提出了HQTrack，一个用于在视频中高质量跟踪任何物体的框架。HQTrack 主要由视频多目标分割器（VMOS）和蒙版优化器（MR）组成。给定视频初始帧中要跟踪的物体，VMOS 将物体蒙版传播到当前帧。由于 VMOS 是在几个最接近的视频目标分割（VOS）数据集上训练的，对于复杂和角落场景的泛化能力有限，因此在这个阶段的蒙版结果并不够准确。为了进一步提高跟踪蒙版的质量，我们采用了预训练的 MR 模型来优化跟踪结果。作为对我们范例有效性的有力证明，在不使用任何技巧，如测试时数据增强和模型集成的情况下，HQTrack 在视觉目标跟踪和分割（VOTS2023）挑战中排名第二。代码和模型可在 https://github.com/jiawen-zhu/HQTrack 找到。

English

Visual object tracking is a fundamental video task in computer vision. Recently, the notably increasing power of perception algorithms allows the unification of single/multiobject and box/mask-based tracking. Among them, the Segment Anything Model (SAM) attracts much attention. In this report, we propose HQTrack, a framework for High Quality Tracking anything in videos. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask refiner (MR). Given the object to be tracked in the initial frame of a video, VMOS propagates the object masks to the current frame. The mask results at this stage are not accurate enough since VMOS is trained on several closeset video object segmentation (VOS) datasets, which has limited ability to generalize to complex and corner scenes. To further improve the quality of tracking masks, a pretrained MR model is employed to refine the tracking results. As a compelling testament to the effectiveness of our paradigm, without employing any tricks such as test-time data augmentations and model ensemble, HQTrack ranks the 2nd place in the Visual Object Tracking and Segmentation (VOTS2023) challenge. Code and models are available at https://github.com/jiawen-zhu/HQTrack.

高质量跟踪任何事物

Tracking Anything in High Quality

摘要

Support