強い仮定は不要：時間差分による視覚表現学習

要旨

AIの進歩は、より少ない仮定に依存する手法によって主に推進されてきた。計算資源とデータ量が増加するにつれ、弱い帰納的バイアスを持つアプローチは、強い仮定を持つものよりも一般的に優れた性能を示す。この傾向は特に視覚表現学習の分野に顕著であり、教師あり学習が支配的だった時代から、弱教師あり学習、そして人間によるラベルを必要としない自己教師あり学習の現在の広範な成功へと移行してきた。しかし、現代の自己教師あり学習手法でさえ、拡張、マスキング、またはクロッピングといった強い帰納的バイアスに依然として依存している。この傾向が続くならば、これらの残されたバイアスすら大規模モデルにおいてボトルネックとなるはずであり、我々の実験はこれを確認している。すなわち、帰納的バイアスの最適な強度はデータが増加するにつれて減少する。このことは、より少ない仮定に依存するアプローチの探求を動機づける。この目的のために、我々はTemporal Difference in Vision（TDV）を導入する。これは動画からの自己教師あり学習のための新しいパラダイムであり、既存の帰納的バイアスを回避し、代わりに「過去が未来を引き起こす」という因果的仮定に依存する。TDVは、画像エンコーダと動作エンコーダを同時に訓練することで機能し、現在のフレームの表現に符号化された動作を加えることで次のフレームの表現と等しくなるようにする。強い帰納的バイアスを一切活用しないにもかかわらず、TDVは高密度な空間的タスクにおいて最先端の手法に匹敵し、強い仮定を必要としない表現学習の基盤を築く。

English

Progress in AI has largely been driven by methods that assume less. As compute and data increase, approaches with weaker inductive biases generally outperform those with stronger assumptions. This is particularly characteristic of the field of Visual Representation Learning, where approaches have gone from being dominated by Supervised Learning, to Weakly Supervised Learning, to the now widespread success of Self-Supervised Learning without human labels. Yet, even modern Self-Supervised Learning approaches still depend on strong inductive biases such as augmentations, masking, or cropping. If this trend holds, even these remaining biases should become bottlenecks at scale -- and our experiments confirm this: the optimal strength of inductive biases decreases as data grows. This motivates the search for approaches that rely on fewer assumptions. To this end, we introduce Temporal Difference in Vision (TDV), a new paradigm for self-supervised learning from video that avoids existing inductive biases, relying instead on a causal assumption that the past causes the future. TDV functions by jointly training an image encoder and a motion encoder so that the current frame's representation plus the encoded motion equals the next frame's representation. Despite not leveraging any strong inductive biases, TDV matches state-of-the-art recipes on dense spatial tasks, laying the foundation for representation learning without strong assumptions.