3D-Fixup: 3D事前情報を用いた写真編集の進展

要旨

拡散モデルによる画像事前分布のモデリングが大きく進展しているにもかかわらず、3D認識画像編集は依然として課題が多い。その一因として、対象物体が単一の画像のみで指定されることが挙げられる。この課題に対処するため、我々は学習された3D事前分布に基づいて2D画像を編集する新しいフレームワーク「3D-Fixup」を提案する。このフレームワークは、物体の移動や3D回転といった難しい編集状況をサポートする。これを実現するために、拡散モデルの生成能力を活用したトレーニングベースのアプローチを採用する。ビデオデータは現実世界の物理的ダイナミクスを自然にエンコードしているため、トレーニングデータペア（ソースフレームとターゲットフレーム）を生成するためにビデオデータを利用する。ソースフレームとターゲットフレーム間の変換を推論するために単一の訓練済みモデルに依存するのではなく、2D情報を明示的に3D空間に投影することでこの難しいタスクを橋渡しするImage-to-3Dモデルからの3Dガイダンスを組み込む。トレーニング全体を通じて高品質な3Dガイダンスを確保するために、データ生成パイプラインを設計する。結果として、これらの3D事前分布を統合することで、3D-Fixupは複雑でアイデンティティに一貫性のある3D認識編集を効果的にサポートし、高品質な結果を達成し、拡散モデルの現実的な画像操作への応用を進展させる。コードはhttps://3dfixup.github.io/で提供されている。

English

Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To achieve this, we leverage a training-based approach that harnesses the generative power of diffusion models. As video data naturally encodes real-world physical dynamics, we turn to video data for generating training data pairs, i.e., a source and a target frame. Rather than relying solely on a single trained model to infer transformations between source and target frames, we incorporate 3D guidance from an Image-to-3D model, which bridges this challenging task by explicitly projecting 2D information into 3D space. We design a data generation pipeline to ensure high-quality 3D guidance throughout training. Results show that by integrating these 3D priors, 3D-Fixup effectively supports complex, identity coherent 3D-aware edits, achieving high-quality results and advancing the application of diffusion models in realistic image manipulation. The code is provided at https://3dfixup.github.io/