ChatPaper.aiChatPaper

通过循环一致性掩码预测学习跨视角物体对应关系

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

February 22, 2026
作者: Shannan Yan, Leqi Zheng, Keyu Lv, Jingchen Ni, Hongyang Wei, Jiajun Zhang, Guangting Wang, Jing Lyu, Chun Yuan, Fengyun Rao
cs.AI

摘要

我们研究视频中跨不同视角建立物体级视觉对应关系的任务,重点关注极具挑战性的第一人称到第三人称及第三人称到第一人称场景。基于条件二元分割,我们提出了一种简洁而高效的框架:将物体查询掩码编码为潜在表征,用以指导目标视频中对应物体的定位。为获得鲁棒且视角不变的表征,我们引入了循环一致性训练目标——将目标视图的预测掩码投影回源视图以重建原始查询掩码。这种双向约束在无需真实标注的情况下提供了强自监督信号,并支持推理时的测试时训练。在Ego-Exo4D和HANDAL-X基准测试上的实验表明,我们的优化目标和测试时训练策略具有显著效果,实现了最先进的性能。代码已开源:https://github.com/shannany0606/CCMP。
English
We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.
PDF131February 25, 2026