다중모드 기반 모델에서 동역학 모델을 활용한 월드 모델 부트스트래핑

초록

비전-언어 기반 모델이 언어로 표현된 행동을 통해 현실 세계 모델(관찰 × 행동 → 관찰)과 역학 모델(관찰 × 관찰 → 행동)을 어느 정도로 보유하고 있는가? 오픈소스 기반 모델은 두 가지 모두에서 어려움을 겪지만, 역학 모델을 지도 학습을 통해 획득하도록 미세 조정하는 것이 세계 모델을 획득하는 것보다 훨씬 쉬운 것으로 나타났다. 이어서 역학 모델은 두 가지 주요 전략을 통해 세계 모델을 부트스트랩하는 데 사용될 수 있다: 1) 합성 데이터를 통한 약한 지도 학습과 2) 추론 시간 검증. 첫째, 역학 모델은 레이블이 없는 비디오 프레임 관찰 쌍에 대해 행동을 주석 처리하여 훈련 데이터를 확장할 수 있다. 또한, 우리는 인식 모델에 의해 예측된 중요도에 따라 관찰 쌍의 이미지 토큰에 가중치를 부여하는 새로운 목적 함수를 제안한다. 둘째, 역학 모델은 세계 모델의 여러 샘플에 보상을 할당하여 점수를 매김으로써 추론 시간에 탐색을 효과적으로 안내할 수 있다. 우리는 Aurora-Bench에서 행동 중심 이미지 편집 작업을 통해 두 전략에서 도출된 세계 모델을 평가한다. 우리의 최고 모델은 최첨단 이미지 편집 모델과 경쟁력 있는 성능을 달성하며, GPT4o-as-judge에 따르면 실제 세계 하위 집합에서 15%의 차이로 개선되었고, Aurora-Bench의 모든 하위 집합에서 최고의 평균 인간 평가를 달성했다.

English

To what extent do vision-and-language foundation models possess a realistic world model (observation times action rightarrow observation) and a dynamics model (observation times observation rightarrow action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on Aurora-Bench. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of 15% on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

다중모드 기반 모델에서 동역학 모델을 활용한 월드 모델 부트스트래핑

Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

초록

Support