Tracktention: ポイントトラッキングを活用した高速かつ高精度なビデオ注意機構

要旨

映像予測において、時間的一貫性は出力の整合性を保ち、アーティファクトを排除するために極めて重要です。従来の手法、例えば時間的アテンションや3D畳み込みは、大きな物体の動きに対応するのが難しく、動的なシーンにおける長期的な時間的依存関係を捉えられない場合があります。この課題を解決するため、我々はTracktention Layerを提案します。これは、フレーム間の対応点のシーケンスであるポイントトラックを用いて、明示的に動き情報を統合する新しいアーキテクチャコンポーネントです。これらの動きの手がかりを取り入れることで、Tracktention Layerは時間的アラインメントを強化し、複雑な物体の動きを効果的に処理し、時間経過に伴う特徴表現の一貫性を維持します。我々のアプローチは計算効率が良く、Vision Transformerなどの既存モデルに最小限の変更でシームレスに統合できます。これにより、画像のみを扱うモデルを最先端の映像モデルにアップグレードすることが可能で、場合によっては映像予測専用に設計されたモデルを凌駕することもあります。我々はこれを映像深度予測と映像カラー化において実証し、Tracktention Layerを追加したモデルがベースラインと比べて時間的一貫性が大幅に向上することを示します。

English

Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion and may not capture long-range temporal dependencies in dynamic scenes. To address this gap, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks, i.e., sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. It can be used to upgrade image-only models to state-of-the-art video ones, sometimes outperforming models natively designed for video prediction. We demonstrate this on video depth prediction and video colorization, where models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baselines.

Tracktention: ポイントトラッキングを活用した高速かつ高精度なビデオ注意機構

Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

要旨

Support