ObjectMover: ビデオ事前分布を用いた生成的オブジェクト移動

要旨

一見単純に見える画像内のオブジェクト移動は、実際には非常に挑戦的な画像編集タスクです。これには、照明の再調和、視点に基づくポーズ調整、遮蔽領域の正確な補填、影や反射の一貫した同期化、そしてオブジェクトの同一性の維持が要求されます。本論文では、高度に複雑なシーンでのオブジェクト移動を可能にする生成モデル「ObjectMover」を提案します。私たちの重要な洞察は、このタスクをシーケンス間問題としてモデル化し、ビデオ生成モデルを微調整して、ビデオフレーム間での一貫したオブジェクト生成の知識を活用することです。このアプローチにより、モデルが複雑な現実世界のシナリオに適応し、極端な照明調和やオブジェクト効果の移動を処理できることを示します。オブジェクト移動のための大規模データが存在しないため、現代のゲームエンジンを使用して高品質なデータペアを合成するデータ生成パイプラインを構築しました。さらに、現実世界のビデオデータでのトレーニングを可能にするマルチタスク学習戦略を提案し、モデルの汎化性能を向上させます。広範な実験を通じて、ObjectMoverが優れた結果を達成し、現実世界のシナリオに適応することを実証します。

English

Simple as it seems, moving an object to another location within an image is, in fact, a challenging image-editing task that requires re-harmonizing the lighting, adjusting the pose based on perspective, accurately filling occluded regions, and ensuring coherent synchronization of shadows and reflections while maintaining the object identity. In this paper, we present ObjectMover, a generative model that can perform object movement in highly challenging scenes. Our key insight is that we model this task as a sequence-to-sequence problem and fine-tune a video generation model to leverage its knowledge of consistent object generation across video frames. We show that with this approach, our model is able to adjust to complex real-world scenarios, handling extreme lighting harmonization and object effect movement. As large-scale data for object movement are unavailable, we construct a data generation pipeline using a modern game engine to synthesize high-quality data pairs. We further propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization. Through extensive experiments, we demonstrate that ObjectMover achieves outstanding results and adapts well to real-world scenarios.

ObjectMover: ビデオ事前分布を用いた生成的オブジェクト移動

ObjectMover: Generative Object Movement with Video Prior

要旨

Support