循環整合的なマスク予測によるクロスビュー物体対応関係の学習（注：タイトルは学術論文の形式に合わせ、技術的厳密性を保ちつつ自然な日本語表現にしています。「Cycle-Consistent」は「循環整合性」と訳し、機械学習分野で確立された訳語を採用しています。「Cross-View Object Correspondence」は「クロスビュー物体対応関係」とし、複数の視点間での物体対応付けの概念を明確に表現しています）

要旨

我々は、映像における異なる視点間での物体レベルの視覚的対応関係を確立するタスクに着目し、特に難易度の高いエゴセントリック（主観視点）からエクソセントリック（客観視点）、およびその逆のシナリオに焦点を当てて研究を行う。本論文では、条件付き二値セグメンテーションに基づく簡潔かつ効果的なフレームワークを提案する。この枠組みでは、物体クエリマスクが潜在表現に符号化され、対象映像内での対応物体の位置特定を誘導する。頑健で視点不変な表現を促進するため、サイクル一貫性トレーニング目標を導入する。すなわち、対象視点で予測されたマスクを源視点に投影し、元のクエリマスクを再構築する。この双方向の制約は、教師データを必要とせず強力な自己教師信号を提供し、推論時にテスト時訓練（TTT）を可能とする。Ego-Exo4DおよびHANDAL-Xベンチマークによる実験では、本最適化目標とTTT戦略の有効性が実証され、State-of-the-Art性能を達成した。コードはhttps://github.com/shannany0606/CCMP で公開されている。

English

We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

要旨

Support