基于文本引导图像到三维的前馈式三维编辑
Feedforward 3D Editing via Text-Steerable Image-to-3D
December 15, 2025
作者: Ziqi Ma, Hongqiao Chen, Yisong Yue, Georgia Gkioxari
cs.AI
摘要
圖像到3D生成技術的最新進展為設計、增強現實/虛擬現實及機器人領域開闢了巨大可能性。然而,要將AI生成的3D資源應用於實際場景,關鍵需求在於具備便捷的編輯能力。我們提出一種前饋式方法Steer3D,可為圖像到3D模型添加文本引導功能,實現通過語言編輯生成的3D資源。該方法受ControlNet啟發,我們將其改編應用於圖像到3D生成領域,從而實現單次前向傳播的文本引導。我們構建了可擴展的自動化數據生成引擎,並基於流匹配訓練與直接偏好優化(DPO)開發了兩階段訓練方案。相較於同類方法,Steer3D能更精準遵循語言指令,同時保持與原始3D資源更優的一致性,且處理速度提升2.4至28.5倍。實驗表明,僅需10萬組數據即可為預訓練的圖像到3D生成模型添加新模態(文本)引導功能。項目網站:https://glab-caltech.github.io/steer3d/
English
Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: https://glab-caltech.github.io/steer3d/