Know3D：利用视觉语言模型知识驱动的三维生成提示

摘要

近年来，三维生成技术的进步显著提升了合成三维资产的逼真度与几何细节。然而，由于单视角观测固有的模糊性，以及有限三维训练数据导致全局结构先验不足，现有模型对不可见区域的生成往往具有随机性且难以控制，有时可能偏离用户意图或产生不合理的几何结构。本文提出Know3D创新框架，通过潜在隐状态注入将多模态大语言模型的丰富知识融入三维生成过程，实现语言可控的三维资产背视图生成。我们采用基于视觉语言模型的扩散模型架构，其中视觉语言模型负责语义理解与引导，扩散模型则作为桥梁将语义知识从视觉语言模型传递至三维生成模型。该方法成功弥合了抽象文本指令与未观测区域几何重建之间的鸿沟，将传统随机性的背视图幻觉转变为语义可控的生成过程，为未来三维生成模型的发展指明了新方向。

English

Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing models are often stochastic and difficult to control, which may sometimes fail to align with user intentions or produce implausible geometries. In this paper, we propose Know3D, a novel framework that incorporates rich knowledge from multimodal large language models into 3D generative processes via latent hidden-state injection, enabling language-controllable generation of the back-view for 3D assets. We utilize a VLM-diffusion-based model, where the VLM is responsible for semantic understanding and guidance. The diffusion model acts as a bridge that transfers semantic knowledge from the VLM to the 3D generation model. In this way, we successfully bridge the gap between abstract textual instructions and the geometric reconstruction of unobserved regions, transforming the traditionally stochastic back-view hallucination into a semantically controllable process, demonstrating a promising direction for future 3D generation models.

Know3D：利用视觉语言模型知识驱动的三维生成提示

Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models

摘要

Support