Know3D:基於視覺語言模型知識驅動的三維生成提示技術
Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models
March 24, 2026
作者: Wenyue Chen, Wenjue Chen, Peng Li, Qinghe Wang, Xu Jia, Heliang Zheng, Rongfei Jia, Yuan Liu, Ronggang Wang
cs.AI
摘要
近期三維生成技術的進步顯著提升了合成三維資產的逼真度與幾何細節。然而,由於單視角觀測固有的模糊性,以及受限的三維訓練數據導致全局結構先驗不足,現有模型對不可見區域的生成往往具有隨機性且難以控制,可能偏離用戶意圖或產生不合理的幾何結構。本文提出Know3D創新框架,通過潛在隱狀態注入將多模態大語言模型的豐富知識融入三維生成過程,實現語言可控的三維資產背視圖生成。我們採用基於VLM-擴散模型的架構:VLM負責語義理解與引導,擴散模型則作為橋樑將VLM的語義知識傳遞至三維生成模型。該方法成功彌合了抽象文本指令與未觀測區域幾何重建之間的鴻溝,將傳統隨機性的背視圖幻覺轉變為語義可控的過程,為未來三維生成模型開闢了嶄新方向。
English
Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing models are often stochastic and difficult to control, which may sometimes fail to align with user intentions or produce implausible geometries. In this paper, we propose Know3D, a novel framework that incorporates rich knowledge from multimodal large language models into 3D generative processes via latent hidden-state injection, enabling language-controllable generation of the back-view for 3D assets. We utilize a VLM-diffusion-based model, where the VLM is responsible for semantic understanding and guidance. The diffusion model acts as a bridge that transfers semantic knowledge from the VLM to the 3D generation model. In this way, we successfully bridge the gap between abstract textual instructions and the geometric reconstruction of unobserved regions, transforming the traditionally stochastic back-view hallucination into a semantically controllable process, demonstrating a promising direction for future 3D generation models.