DriveGen3D：利用高效视频扩散技术增强前馈驾驶场景生成

摘要

我们提出了DriveGen3D，一个创新框架，旨在生成高质量且高度可控的动态3D驾驶场景，以解决现有方法中的关键限制。当前驾驶场景合成方法要么因长时间生成而面临计算资源的高昂需求，要么仅专注于长时间视频合成而缺乏3D表示，或者局限于静态单场景重建。我们的工作通过多模态条件控制，将加速的长期视频生成与大规模动态场景重建相结合，填补了这一方法学上的空白。DriveGen3D引入了一个统一流程，包含两个专门组件：FastDrive-DiT，一种高效的视频扩散变换器，在文本和鸟瞰图（BEV）布局指导下实现高分辨率、时间连贯的视频合成；以及FastRecon3D，一个前馈重建模块，快速构建跨时间的3D高斯表示，确保时空一致性。这两个组件共同实现了实时生成扩展驾驶视频（最高可达424×800分辨率，12帧每秒）及相应的动态3D场景，在新视角合成上达到了SSIM 0.811和PSNR 22.84，同时保持了参数效率。

English

We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward reconstruction module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. Together, these components enable real-time generation of extended driving videos (up to 424times800 at 12 FPS) and corresponding dynamic 3D scenes, achieving SSIM of 0.811 and PSNR of 22.84 on novel view synthesis, all while maintaining parameter efficiency.

DriveGen3D：利用高效视频扩散技术增强前馈驾驶场景生成

DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

摘要

Support