Tracktention：利用点追踪技术实现更快更优的视频注意力机制

摘要

在视频预测中，时间一致性至关重要，它确保了输出结果的连贯性并避免了伪影。传统方法，如时间注意力机制和三维卷积，可能在处理显著物体运动时表现欠佳，且难以捕捉动态场景中的长程时间依赖关系。为弥补这一不足，我们提出了Tracktention层，这是一种新颖的架构组件，它通过点轨迹（即跨帧对应点的序列）显式地整合运动信息。通过融入这些运动线索，Tracktention层增强了时间对齐能力，有效处理复杂的物体运动，保持特征表示在时间上的一致性。我们的方法计算效率高，能够以最小的改动无缝集成到现有模型（如视觉Transformer）中，可将仅处理图像的模型升级为先进的视频处理模型，有时甚至超越专为视频预测设计的原生模型。我们在视频深度预测和视频着色任务中验证了这一点，与基线模型相比，配备Tracktention层的模型在时间一致性上展现出显著提升。

English

Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion and may not capture long-range temporal dependencies in dynamic scenes. To address this gap, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks, i.e., sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. It can be used to upgrade image-only models to state-of-the-art video ones, sometimes outperforming models natively designed for video prediction. We demonstrate this on video depth prediction and video colorization, where models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baselines.

Tracktention：利用点追踪技术实现更快更优的视频注意力机制

Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

摘要

Support