VoxHammer: 네이티브 3D 공간에서의 학습 없이도 정밀하고 일관된 3D 편집

초록

게임 산업과 로봇 상호작용을 위해 특정 영역의 3D 로컬 편집은 매우 중요합니다. 최근의 방법들은 일반적으로 렌더링된 다중 뷰 이미지를 편집한 후 3D 모델을 재구성하지만, 편집되지 않은 영역을 정확하게 보존하고 전반적인 일관성을 유지하는 데 어려움을 겪습니다. 구조화된 3D 생성 모델에서 영감을 받아, 우리는 3D 잠재 공간에서 정확하고 일관된 편집을 수행하는 새로운 학습 없는 접근 방식인 VoxHammer를 제안합니다. 주어진 3D 모델에 대해 VoxHammer는 먼저 역전 궤적을 예측하고 각 시간 단계에서 역전된 잠재 변수와 키-값 토큰을 얻습니다. 이후, 노이즈 제거 및 편집 단계에서 보존된 영역의 노이즈 제거 특징을 해당 역전된 잠재 변수와 캐시된 키-값 토큰으로 대체합니다. 이러한 문맥적 특징을 유지함으로써, 이 접근 방식은 보존된 영역의 일관된 재구성과 편집된 부분의 일관된 통합을 보장합니다. 보존된 영역의 일관성을 평가하기 위해, 우리는 수백 개의 샘플로 구성된 인간 주석 데이터셋인 Edit3D-Bench를 구축했습니다. 각 샘플은 신중하게 라벨링된 3D 편집 영역을 포함하고 있습니다. 실험 결과, VoxHammer는 보존된 영역의 3D 일관성과 전반적인 품질 측면에서 기존 방법들을 크게 능가하는 것으로 나타났습니다. 우리의 방법은 고품질의 편집된 짝 데이터를 합성하여 문맥 내 3D 생성을 위한 데이터 기반을 마련할 수 있을 것으로 기대됩니다. 자세한 내용은 프로젝트 페이지(https://huanngzh.github.io/VoxHammer-Page/)를 참조하십시오.

English

3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.

VoxHammer: 네이티브 3D 공간에서의 학습 없이도 정밀하고 일관된 3D 편집

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

초록

Support