시각 지능 재고: 비디오 사전 훈련에서 얻은 통찰

초록

대규모 언어 모델(LLM)은 방대한 규모의 사전 훈련이 언어 영역에서 적은 지도만으로 시스템이 새로운 문제에 빠르게 적응할 수 있게 한다는 점을 입증해왔습니다. 그러나 이러한 성공은 시각 영역에서는 동일하게 효과적으로 나타나지 않고 있으며, LLM을 포함한 모델들은 여전히 구성적 이해, 샘플 효율성, 일반적인 문제 해결 능력에서 어려움을 겪고 있습니다. 본 연구는 이러한 격차를 해소할 유망한 방향으로서 비디오 확산 모델(VDM)을 조사합니다. 시공간 데이터에 대한 사전 훈련은 이러한 모델에 구조와 역학에 대한 강력한 귀납적 편향을 부여하며, 이는 광범위한 작업 적응성을 지원할 수 있을 것으로 가정합니다. 이를 검증하기 위해 사전 훈련된 LLM과 사전 훈련된 VDM 모두에 경량 어댑터를 장착하고 각각의 고유 영역 내 작업을 수행하도록 하는 통제된 평가를 설계합니다. ARC-AGI, ConceptARC, 시각 게임, 경로 계획, 셀룰러 오토마타를 포함한 벤치마크 전반에서 VDM은 언어 모델 대비 더 높은 데이터 효율성을 보여줍니다. 종합적으로, 우리의 결과는 비디오 사전 훈련이 시각 기초 모델 발전을 지원하는 귀납적 편향을 제공함을 시사합니다.

English

Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

시각 지능 재고: 비디오 사전 훈련에서 얻은 통찰

Rethinking Visual Intelligence: Insights from Video Pretraining

초록

Support