ChatPaper.aiChatPaper

基于文本引导图像到三维的前馈式3D编辑

Feedforward 3D Editing via Text-Steerable Image-to-3D

December 15, 2025
作者: Ziqi Ma, Hongqiao Chen, Yisong Yue, Georgia Gkioxari
cs.AI

摘要

图像生成3D技术的最新进展为设计、增强现实/虚拟现实(AR/VR)及机器人领域开辟了广阔前景。然而,要将AI生成的3D资源投入实际应用,关键需求在于具备便捷的编辑能力。我们提出一种前馈式方法Steer3D,通过为图像生成3D模型添加文本导向功能,实现用语言编辑生成3D资源的能力。该方法受ControlNet启发,我们将其适配于图像生成3D领域,从而在前向传播中直接实现文本导向。我们构建了可扩展的自动数据生成引擎,并开发了基于流匹配训练与直接偏好优化(DPO)的两阶段训练方案。相较于现有方法,Steer3D在更精准遵循语言指令的同时,能更好地保持与原始3D资源的一致性,且处理速度提升2.4至28.5倍。Steer3D证明仅需10万数据即可为预训练的图像生成3D模型添加新模态(文本)导向能力。项目网站:https://glab-caltech.github.io/steer3d/
English
Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: https://glab-caltech.github.io/steer3d/
PDF131December 18, 2025