VoxHammer: ネイティブ3D空間におけるトレーニング不要の精密で一貫性のある3D編集

要旨

ゲーム産業やロボットインタラクションにおいて、特定領域の3Dローカル編集は極めて重要です。最近の手法では、通常レンダリングされたマルチビュー画像を編集し、その後3Dモデルを再構築しますが、未編集領域の正確な保存と全体の一貫性の維持に課題を抱えています。構造化された3D生成モデルに着想を得て、我々はVoxHammerを提案します。これは3D潜在空間において精密かつ一貫性のある編集を実行する、新しいトレーニング不要のアプローチです。3Dモデルが与えられると、VoxHammerはまずその反転軌道を予測し、各タイムステップにおける反転潜在変数とキー・バリュートークンを取得します。その後、ノイズ除去と編集フェーズでは、保存領域のノイズ除去特徴を対応する反転潜在変数とキャッシュされたキー・バリュートークンで置き換えます。これらの文脈的特徴を保持することで、保存領域の一貫した再構築と編集部分の調和のとれた統合が保証されます。保存領域の一貫性を評価するため、我々はEdit3D-Benchを構築しました。これは数百のサンプルからなる人間によるアノテーションデータセットで、各サンプルには注意深くラベル付けされた3D編集領域が含まれています。実験の結果、VoxHammerは保存領域の3D一貫性と全体的な品質の両面において、既存の手法を大幅に上回ることが示されました。我々の手法は、高品質な編集済みペアデータの合成に有望であり、文脈内3D生成のためのデータ基盤を築くものです。プロジェクトページはhttps://huanngzh.github.io/VoxHammer-Page/をご覧ください。

English

3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.

VoxHammer: ネイティブ3D空間におけるトレーニング不要の精密で一貫性のある3D編集

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

要旨

Support