TAPFormer：通过帧与事件的瞬态异步融合实现鲁棒性任意点追踪

摘要

追踪任意点（TAP）是计算机视觉中基础但具有挑战性的任务，需要高精度和长时序运动推理。近期结合RGB帧与事件流的研究虽展现出潜力，但通常依赖同步或非自适应融合机制，导致模态失效时出现时序错位与性能急剧下降。我们提出TAPFormer——基于Transformer的框架，通过异步时序一致融合机制实现鲁棒的高频率任意点追踪。其核心创新是瞬态异步融合（TAF）机制，通过连续事件更新显式建模离散帧间的时序演化，弥合低帧率视频与高频率事件流之间的鸿沟。此外，跨模态局部加权融合（CLWF）模块能根据模态可靠性自适应调整空间注意力，即使在模糊或低光条件下也能生成稳定且具判别性的特征。为在真实场景下评估方法，我们构建了包含多种光照与运动条件的新型真实世界帧-事件TAP数据集。本方法显著优于现有点追踪器，在阈值内平均像素误差降低28.2%。在标准点追踪基准测试中，我们的追踪器持续保持最优性能。项目主页：tapformer.github.io

English

Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io

TAPFormer：通过帧与事件的瞬态异步融合实现鲁棒性任意点追踪

TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

摘要

Support