コンテナとオクルーダーを伴う実環境でのトラッキング

要旨

雑多で動的な環境における物体の持続的追跡は、コンピュータビジョンシステムにとって依然として困難な課題です。本論文では、重度の遮蔽や封じ込めを伴う視覚的追跡のための新しいベンチマークとモデルであるTCOWを紹介します。私たちは、ビデオシーケンスが与えられた際に、対象物体の投影範囲と、存在する場合にはそれを囲む容器や遮蔽物の両方をセグメント化することを目的としたタスクを設定しました。このタスクを研究するために、教師あり学習と、移動や入れ子状の封じ込めなどのさまざまなタスク変動下でのモデル性能の構造的評価をサポートするため、合成データと注釈付き実データを組み合わせたデータセットを作成しました。最近のトランスフォーマーベースのビデオモデル2つを評価した結果、特定のタスク変動設定下では驚くほど追跡能力を発揮するものの、物体の永続性を真に理解した追跡モデルと言えるまでにはまだ大きな性能ギャップが存在することがわかりました。

English

Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems. In this paper, we introduce TCOW, a new benchmark and model for visual tracking through heavy occlusion and containment. We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists. To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment. We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.

コンテナとオクルーダーを伴う実環境でのトラッキング

Tracking through Containers and Occluders in the Wild

要旨

Support