MIDI: 単一画像から3Dシーン生成へのマルチインスタンス拡散

要旨

この論文では、単一の画像からの構成的な3Dシーン生成のための新しいパラダイムであるMIDIを紹介します。従来の手法が再構築や検索技術に依存するのに対し、または最近の手法が多段階のオブジェクトごとの生成を採用するのに対して、MIDIは事前学習された画像から3Dオブジェクトを生成するモデルを複数のインスタンスの拡散モデルに拡張し、正確な空間関係と高い汎化性を備えた複数の3Dインスタンスを同時に生成することが可能となります。MIDIの中心には、効果的にオブジェクト間の相互作用と空間的な一貫性を生成プロセス内で直接捉える革新的なマルチインスタンスアテンションメカニズムが組み込まれており、複雑な多段階プロセスを必要とせずにいます。この手法は、部分的なオブジェクト画像とグローバルなシーンコンテキストを入力とし、3D生成中にオブジェクトの完了を直接モデリングします。トレーニング中、シーンレベルのデータの限られた量を使用して3Dインスタンス間の相互作用を効果的に監督し、単一オブジェクトデータを正則化に組み込むことで、事前学習された汎化能力を維持します。MIDIは、合成データ、実世界のシーンデータ、およびテキストから画像への拡散モデルによって生成されたスタイル化されたシーン画像の評価を通じて検証された画像からシーンへの生成において最先端のパフォーマンスを示しています。

English

This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.

MIDI: 単一画像から3Dシーン生成へのマルチインスタンス拡散

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

要旨

Support