Know3D: 비전-언어 모델의 지식을 활용한 3D 생성 프롬프팅

초록

최근 3D 생성 기술의 발전으로 합성된 3D 에셋의 정확도와 기하학적 디테일이 향상되었습니다. 그러나 단일 뷰 관측의 본질적 모호성과 제한된 3D 학습 데이터로 인한 견고한 전역 구조 사전 지식의 부족으로 기존 모델들이 생성하는 보이지 않는 영역은 종종 확률적이며 제어가 어려워, 사용자 의도와 일치하지 않거나 비현실적인 기하구조를 생성하는 경우가 있습니다. 본 논문에서는 다중 모드 대규모 언어 모델의 풍부한 지식을 잠재 은닉 상태 주입을 통해 3D 생성 과정에 통합하는 새로운 프레임워크인 Know3D를 제안합니다. 이를 통해 3D 에셋의 후면 뷰에 대한 언어 기반 제어 생성이 가능해집니다. 우리는 VLM이 의미론적 이해와 guidance를 담당하는 VLM-디퓨전 기반 모델을 활용합니다. 디퓨전 모델은 VLM의 의미론적 지식을 3D 생성 모델로 전달하는 교량 역할을 수행합니다. 이를 통해 추상적인 텍스트 명령과 관측되지 않은 영역의 기하학적 재구성 간의 간격을 성공적으로 연결하며, 기존의 확률적 후면 뷰 추론 과정을 의미론적으로 제어 가능한 프로세스로 전환함으로써 미래 3D 생성 모델의 발전 방향을 제시합니다.

English

Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing models are often stochastic and difficult to control, which may sometimes fail to align with user intentions or produce implausible geometries. In this paper, we propose Know3D, a novel framework that incorporates rich knowledge from multimodal large language models into 3D generative processes via latent hidden-state injection, enabling language-controllable generation of the back-view for 3D assets. We utilize a VLM-diffusion-based model, where the VLM is responsible for semantic understanding and guidance. The diffusion model acts as a bridge that transfers semantic knowledge from the VLM to the 3D generation model. In this way, we successfully bridge the gap between abstract textual instructions and the geometric reconstruction of unobserved regions, transforming the traditionally stochastic back-view hallucination into a semantically controllable process, demonstrating a promising direction for future 3D generation models.

Know3D: 비전-언어 모델의 지식을 활용한 3D 생성 프롬프팅

Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models

초록

Support