강한 가정은 필요 없다: 시간적 차이를 통한 시각 표현 학습

초록

AI의 발전은 주로 더 적은 가정을 하는 방법에 의해 주도되어 왔다. 계산 능력과 데이터가 증가함에 따라, 약한 귀납적 편향을 가진 접근법이 강한 가정을 가진 접근법보다 일반적으로 더 나은 성능을 보인다. 이는 특히 시각 표현 학습 분야의 특징인데, 해당 분야에서는 접근법이 지도 학습이 주를 이루던 시기에서 약지도 학습을 거쳐, 현재는 인간의 레이블 없이도 널리 성공을 거둔 자기지도 학습으로 발전해 왔다. 그러나 현대의 자기지도 학습 접근법조차도 증강, 마스킹, 크롭핑과 같은 강한 귀납적 편향에 여전히 의존하고 있다. 이러한 추세가 유지된다면, 이러한 남은 편향들조차도 규모가 커질수록 병목 현상이 될 것이다. 우리의 실험은 이를 확인해 주는데, 데이터가 증가함에 따라 귀납적 편향의 최적 강도는 감소한다. 이는 더 적은 가정에 의존하는 접근법에 대한 탐구를 촉진한다. 이러한 목적을 위해, 우리는 기존의 귀납적 편향을 피하고 대신 과거가 미래를 야기한다는 인과적 가정에 의존하는 비디오 기반 자기지도 학습의 새로운 패러다임인 시간차 비전 학습(TDV)을 소개한다. TDV는 이미지 인코더와 모션 인코더를 공동으로 훈련하여 현재 프레임의 표현과 인코딩된 모션의 합이 다음 프레임의 표현과 같아지도록 작동한다. 강한 귀납적 편향을 전혀 활용하지 않음에도 불구하고, TDV는 밀집 공간 작업에서 최첨단 방법론과 동등한 성능을 보여주며, 강한 가정 없이 표현 학습을 위한 기초를 마련한다.

English

Progress in AI has largely been driven by methods that assume less. As compute and data increase, approaches with weaker inductive biases generally outperform those with stronger assumptions. This is particularly characteristic of the field of Visual Representation Learning, where approaches have gone from being dominated by Supervised Learning, to Weakly Supervised Learning, to the now widespread success of Self-Supervised Learning without human labels. Yet, even modern Self-Supervised Learning approaches still depend on strong inductive biases such as augmentations, masking, or cropping. If this trend holds, even these remaining biases should become bottlenecks at scale -- and our experiments confirm this: the optimal strength of inductive biases decreases as data grows. This motivates the search for approaches that rely on fewer assumptions. To this end, we introduce Temporal Difference in Vision (TDV), a new paradigm for self-supervised learning from video that avoids existing inductive biases, relying instead on a causal assumption that the past causes the future. TDV functions by jointly training an image encoder and a motion encoder so that the current frame's representation plus the encoded motion equals the next frame's representation. Despite not leveraging any strong inductive biases, TDV matches state-of-the-art recipes on dense spatial tasks, laying the foundation for representation learning without strong assumptions.