VIST3A:通过将多视角重建网络与视频生成器拼接实现文本到三维的转换
VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
October 15, 2025
作者: Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, Konrad Schindler
cs.AI
摘要
大型预训练模型在视觉内容生成和三维重建领域的快速发展,为文本到三维生成开辟了新的可能性。直观上,若能结合现代潜变量文本到视频模型作为“生成器”的强大能力,以及近期(前馈式)三维重建系统作为“解码器”的几何处理能力,便可获得一个强大的三维场景生成器。我们提出了VIST3A,一个实现这一目标的通用框架,解决了两个主要挑战。首先,两个组件必须以保留其权重中丰富知识的方式相结合。我们重新审视了模型拼接技术,即识别出三维解码器中与文本到视频生成器产生的潜变量表示最为匹配的层次,并将两部分拼接起来。这一操作仅需少量数据集且无需标签。其次,文本到视频生成器必须与拼接后的三维解码器对齐,以确保生成的潜变量能够解码为一致且感知上令人信服的三维场景几何。为此,我们采用了直接奖励微调,这是一种流行的人类偏好对齐技术。我们评估了所提出的VIST3A方法,使用不同的视频生成器和三维重建模型进行测试。所有测试组合均显著优于之前输出高斯溅射的文本到三维模型。此外,通过选择合适的三维基础模型,VIST3A还能实现高质量的文本到点云图生成。
English
The rapid progress of large, pretrained models for both visual content
generation and 3D reconstruction opens up new possibilities for text-to-3D
generation. Intuitively, one could obtain a formidable 3D scene generator if
one were able to combine the power of a modern latent text-to-video model as
"generator" with the geometric abilities of a recent (feedforward) 3D
reconstruction system as "decoder". We introduce VIST3A, a general framework
that does just that, addressing two main challenges. First, the two components
must be joined in a way that preserves the rich knowledge encoded in their
weights. We revisit model stitching, i.e., we identify the layer in the 3D
decoder that best matches the latent representation produced by the
text-to-video generator and stitch the two parts together. That operation
requires only a small dataset and no labels. Second, the text-to-video
generator must be aligned with the stitched 3D decoder, to ensure that the
generated latents are decodable into consistent, perceptually convincing 3D
scene geometry. To that end, we adapt direct reward finetuning, a popular
technique for human preference alignment. We evaluate the proposed VIST3A
approach with different video generators and 3D reconstruction models. All
tested pairings markedly improve over prior text-to-3D models that output
Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also
enables high-quality text-to-pointmap generation.