고품질의 모든 객체 추적

초록

시각 객체 추적은 컴퓨터 비전에서 기본적인 비디오 작업 중 하나입니다. 최근, 인지 알고리즘의 성능이 크게 향상되면서 단일/다중 객체 추적과 박스/마스크 기반 추적의 통합이 가능해졌습니다. 이 중에서도 Segment Anything Model(SAM)이 큰 주목을 받고 있습니다. 본 보고서에서는 비디오에서 고품질의 객체 추적을 위한 프레임워크인 HQTrack을 제안합니다. HQTrack은 주로 비디오 다중 객체 분할기(VMOS)와 마스크 정제기(MR)로 구성됩니다. 비디오의 초기 프레임에서 추적할 객체가 주어지면, VMOS는 해당 객체의 마스크를 현재 프레임으로 전파합니다. 이 단계에서의 마스크 결과는 VMOS가 여러 클로즈셋 비디오 객체 분할(VOS) 데이터셋으로 학습되었기 때문에 복잡하고 극단적인 장면에 대한 일반화 능력이 제한적이어서 충분히 정확하지 않습니다. 추적 마스크의 품질을 더욱 향상시키기 위해, 사전 학습된 MR 모델을 사용하여 추적 결과를 정제합니다. 우리의 패러다임의 효과를 입증하는 강력한 증거로, 테스트 시 데이터 증강 및 모델 앙상블과 같은 기법을 사용하지 않고도 HQTrack은 Visual Object Tracking and Segmentation(VOTS2023) 챌린지에서 2위를 차지했습니다. 코드와 모델은 https://github.com/jiawen-zhu/HQTrack에서 확인할 수 있습니다.

English

Visual object tracking is a fundamental video task in computer vision. Recently, the notably increasing power of perception algorithms allows the unification of single/multiobject and box/mask-based tracking. Among them, the Segment Anything Model (SAM) attracts much attention. In this report, we propose HQTrack, a framework for High Quality Tracking anything in videos. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask refiner (MR). Given the object to be tracked in the initial frame of a video, VMOS propagates the object masks to the current frame. The mask results at this stage are not accurate enough since VMOS is trained on several closeset video object segmentation (VOS) datasets, which has limited ability to generalize to complex and corner scenes. To further improve the quality of tracking masks, a pretrained MR model is employed to refine the tracking results. As a compelling testament to the effectiveness of our paradigm, without employing any tricks such as test-time data augmentations and model ensemble, HQTrack ranks the 2nd place in the Visual Object Tracking and Segmentation (VOTS2023) challenge. Code and models are available at https://github.com/jiawen-zhu/HQTrack.

고품질의 모든 객체 추적

Tracking Anything in High Quality

초록

Support