Tracktention：利用點追蹤技術實現更快速、更精準的視頻注意力機制

摘要

時間一致性在視頻預測中至關重要，以確保輸出結果連貫且無偽影。傳統方法，如時間注意力機制和三維卷積，可能在處理顯著物體運動時遇到困難，且難以捕捉動態場景中的長程時間依賴關係。為解決這一問題，我們提出了Tracktention層，這是一種新穎的架構組件，它通過點軌跡（即跨幀的對應點序列）顯式地整合運動信息。通過引入這些運動線索，Tracktention層增強了時間對齊能力，有效處理複雜的物體運動，並在時間上保持特徵表示的一致性。我們的方法計算效率高，能夠以最小的修改無縫集成到現有模型（如視覺Transformer）中。它可用於將僅處理圖像的模型升級為最先進的視頻模型，有時甚至超越專為視頻預測設計的模型。我們在視頻深度預測和視頻着色任務中展示了這一點，其中配備了Tracktention層的模型相比基準模型展現出顯著提升的時間一致性。

English

Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion and may not capture long-range temporal dependencies in dynamic scenes. To address this gap, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks, i.e., sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. It can be used to upgrade image-only models to state-of-the-art video ones, sometimes outperforming models natively designed for video prediction. We demonstrate this on video depth prediction and video colorization, where models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baselines.

Tracktention：利用點追蹤技術實現更快速、更精準的視頻注意力機制

Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

摘要

Support