Sel3DCraft: 사용자 친화적인 텍스트-3D 생성을 위한 인터랙티브 시각적 프롬프트

초록

텍스트-투-3D(T23D) 생성은 디지털 콘텐츠 제작을 혁신적으로 변화시켰지만, 예측 불가능한 결과를 초래하는 맹목적인 시행착오 프롬프트 과정으로 인해 여전히 병목 현상을 겪고 있다. 텍스트-투-이미지 분야에서는 시각적 프롬프트 엔지니어링이 발전했지만, 이를 3D 생성에 적용할 때는 다중 뷰 일관성 평가와 공간적 이해가 요구되는 독특한 도전 과제가 존재한다. 본 논문에서는 T23D를 위한 시각적 프롬프트 엔지니어링 시스템인 Sel3DCraft를 소개한다. 이 시스템은 구조화되지 않은 탐색 과정을 가이드된 시각적 프로세스로 전환한다. 우리의 접근 방식은 세 가지 주요 혁신을 도입한다: 다양한 후보 탐색을 위해 검색과 생성을 결합한 이중 분기 구조; 인간 전문가 수준의 일관성으로 3D 모델을 평가하기 위해 혁신적인 고수준 메트릭과 MLLM(Multi-modal Large Language Models)을 활용한 다중 뷰 하이브리드 스코어링 접근법; 그리고 직관적인 결함 식별 및 개선을 가능하게 하는 프롬프트 기반 시각적 분석 도구 모음이다. 광범위한 테스트와 사용자 연구를 통해 Sel3DCraft가 디자이너들의 창의성을 지원하는 데 있어 다른 T23D 시스템을 능가함을 입증하였다.

English

Text-to-3D (T23D) generation has transformed digital content creation, yet remains bottlenecked by blind trial-and-error prompting processes that yield unpredictable results. While visual prompt engineering has advanced in text-to-image domains, its application to 3D generation presents unique challenges requiring multi-view consistency evaluation and spatial understanding. We present Sel3DCraft, a visual prompt engineering system for T23D that transforms unstructured exploration into a guided visual process. Our approach introduces three key innovations: a dual-branch structure combining retrieval and generation for diverse candidate exploration; a multi-view hybrid scoring approach that leverages MLLMs with innovative high-level metrics to assess 3D models with human-expert consistency; and a prompt-driven visual analytics suite that enables intuitive defect identification and refinement. Extensive testing and user studies demonstrate that Sel3DCraft surpasses other T23D systems in supporting creativity for designers.