Prox-E：基於原語抽象化的精細化三維形狀編輯

摘要

基於文本的二維圖像編輯模型近期已臻成熟，這促使越來越多的研究依賴此類模型驅動三維編輯。儘管這類以二維為核心的三維編輯流程在外觀修改方面表現出色，但在需要嚴格保持物體整體特徵的局部結構精細化編輯任務中往往力不從心。為解決此侷限性，我們提出Prox-E——一種無需訓練的框架，通過顯式的基於幾何基元的抽象化表徵實現精細化三維控制。我們的框架首先將輸入的三維形體抽象為緊湊的幾何基元集合，隨後由預訓練的視覺語言模型（VLM）對此抽象表徵進行基元層級的編輯改動。這些結構化編輯指令將引導三維生成模型，在保持原始形體未修改區域的同時實現局部精細化調整。通過大量實驗驗證，本方法在特徵保持度、形體質量與指令還原度三方面的平衡性持續優於現有多種方案，包括基於二維圖像的三維編輯器與需訓練的對照方法。

English

Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object's overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision-language model (VLM) then edits this abstraction to specify primitive-level changes. These structural edits are subsequently used to guide a 3D generative model, enabling fine-grained, localized modifications while preserving unchanged regions of the original shape. Through extensive experiments, we demonstrate that our method consistently balances identity preservation, shape quality, and instruction fidelity more effectively than various existing approaches, including 2D-based 3D editors and training-based methods.

Prox-E：基於原語抽象化的精細化三維形狀編輯

Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions

摘要

Support