VoxHammer：原生3D空间中的免训练精准连贯编辑

摘要

在游戏产业和机器人交互领域，对指定区域进行3D局部编辑至关重要。现有方法通常先编辑渲染的多视角图像，再重建3D模型，但它们在精确保留未编辑区域和整体一致性方面面临挑战。受结构化3D生成模型的启发，我们提出了VoxHammer，一种无需训练的新方法，能在3D潜在空间中进行精确且连贯的编辑。给定一个3D模型，VoxHammer首先预测其反演轨迹，并在每个时间步获取其反演潜在表示及键值对标记。随后，在去噪和编辑阶段，我们将保留区域的去噪特征替换为相应的反演潜在表示和缓存的键值对标记。通过保留这些上下文特征，该方法确保了保留区域的一致重建以及编辑部分的连贯整合。为了评估保留区域的一致性，我们构建了Edit3D-Bench，这是一个包含数百个样本的人工标注数据集，每个样本都带有精心标记的3D编辑区域。实验表明，VoxHammer在保留区域的3D一致性和整体质量方面均显著优于现有方法。我们的方法有望合成高质量的编辑配对数据，从而为上下文中的3D生成奠定数据基础。访问我们的项目页面：https://huanngzh.github.io/VoxHammer-Page/。

English

3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.

VoxHammer：原生3D空间中的免训练精准连贯编辑

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

摘要

Support