Sel3DCraft：面向用户友好的文本到3D生成的交互式视觉提示

摘要

文本到3D（T23D）生成技术已革新了数字内容创作领域，但仍受制于盲目试错的提示过程，导致结果难以预测。尽管视觉提示工程在文本到图像领域取得了进展，但其在3D生成中的应用面临独特挑战，需进行多视角一致性评估与空间理解。我们推出了Sel3DCraft，这是一套专为T23D设计的视觉提示工程系统，将无序探索转化为有引导的视觉流程。我们的方法引入了三大创新：结合检索与生成的双分支结构，以探索多样候选方案；采用多视角混合评分方法，利用多模态大语言模型（MLLMs）及创新性高层次指标，以人类专家一致性评估3D模型；以及一套提示驱动的视觉分析工具集，支持直观缺陷识别与优化。广泛的测试与用户研究表明，Sel3DCraft在支持设计师创造力方面超越了其他T23D系统。

English

Text-to-3D (T23D) generation has transformed digital content creation, yet remains bottlenecked by blind trial-and-error prompting processes that yield unpredictable results. While visual prompt engineering has advanced in text-to-image domains, its application to 3D generation presents unique challenges requiring multi-view consistency evaluation and spatial understanding. We present Sel3DCraft, a visual prompt engineering system for T23D that transforms unstructured exploration into a guided visual process. Our approach introduces three key innovations: a dual-branch structure combining retrieval and generation for diverse candidate exploration; a multi-view hybrid scoring approach that leverages MLLMs with innovative high-level metrics to assess 3D models with human-expert consistency; and a prompt-driven visual analytics suite that enables intuitive defect identification and refinement. Extensive testing and user studies demonstrate that Sel3DCraft surpasses other T23D systems in supporting creativity for designers.