VoxHammer：無需訓練的原生3D空間精確連貫編輯

摘要

三維局部編輯在遊戲產業和機器人互動中至關重要。現有方法通常對渲染的多視角圖像進行編輯後重建三維模型，但在精確保留未編輯區域和整體一致性方面面臨挑戰。受結構化三維生成模型的啟發，我們提出了VoxHammer，這是一種無需訓練的新方法，能在三維潛在空間中實現精確且連貫的編輯。給定一個三維模型，VoxHammer首先預測其反轉軌跡，並在每個時間步獲取其反轉潛在變量和鍵值對令牌。隨後，在去噪和編輯階段，我們將保留區域的去噪特徵替換為相應的反轉潛在變量和緩存的鍵值對令牌。通過保留這些上下文特徵，該方法確保了保留區域的一致重建以及編輯部分的連貫整合。為了評估保留區域的一致性，我們構建了Edit3D-Bench，這是一個包含數百個樣本的人工標註數據集，每個樣本都經過精心標記的三維編輯區域。實驗表明，VoxHammer在保留區域的三維一致性和整體質量方面均顯著優於現有方法。我們的方法有望合成高質量的編輯配對數據，從而為上下文內的三維生成奠定數據基礎。詳見我們的項目頁面：https://huanngzh.github.io/VoxHammer-Page/。

English

3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.

VoxHammer：無需訓練的原生3D空間精確連貫編輯

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

摘要

Support