Alles in Hoge Kwaliteit Volgen

Samenvatting

Visueel objecttracking is een fundamentele videotaak binnen computervisie. Recentelijk stelt de aanzienlijk toenemende kracht van perceptie-algoritmen de unificatie van enkelvoudig/meervoudig object- en box/masker-gebaseerd tracking mogelijk. Onder hen trekt het Segment Anything Model (SAM) veel aandacht. In dit rapport stellen we HQTrack voor, een raamwerk voor High Quality Tracking van alles in video's. HQTrack bestaat voornamelijk uit een video multi-object segmenter (VMOS) en een masker verfijner (MR). Gegeven het object dat in het eerste frame van een video moet worden gevolgd, propageert VMOS de objectmaskers naar het huidige frame. De maskerresultaten in dit stadium zijn niet nauwkeurig genoeg, aangezien VMOS is getraind op verschillende closeset video objectsegmentatie (VOS) datasets, wat een beperkt vermogen heeft om te generaliseren naar complexe en hoekige scènes. Om de kwaliteit van de trackingmaskers verder te verbeteren, wordt een voorgetraind MR-model gebruikt om de trackingresultaten te verfijnen. Als een overtuigend bewijs van de effectiviteit van ons paradigma, zonder gebruik te maken van trucs zoals test-time data augmentaties en modelensemble, staat HQTrack op de 2e plaats in de Visual Object Tracking and Segmentation (VOTS2023) challenge. Code en modellen zijn beschikbaar op https://github.com/jiawen-zhu/HQTrack.

English

Visual object tracking is a fundamental video task in computer vision. Recently, the notably increasing power of perception algorithms allows the unification of single/multiobject and box/mask-based tracking. Among them, the Segment Anything Model (SAM) attracts much attention. In this report, we propose HQTrack, a framework for High Quality Tracking anything in videos. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask refiner (MR). Given the object to be tracked in the initial frame of a video, VMOS propagates the object masks to the current frame. The mask results at this stage are not accurate enough since VMOS is trained on several closeset video object segmentation (VOS) datasets, which has limited ability to generalize to complex and corner scenes. To further improve the quality of tracking masks, a pretrained MR model is employed to refine the tracking results. As a compelling testament to the effectiveness of our paradigm, without employing any tricks such as test-time data augmentations and model ensemble, HQTrack ranks the 2nd place in the Visual Object Tracking and Segmentation (VOTS2023) challenge. Code and models are available at https://github.com/jiawen-zhu/HQTrack.

Alles in Hoge Kwaliteit Volgen

Tracking Anything in High Quality

Samenvatting

Support