高速形状生成のためのVecset Diffusion Modelの解放

要旨

3D形状生成は、いわゆる「ネイティブ」3D拡散、特にVecset Diffusion Model（VDM）の開発を通じて大きく発展してきました。最近の進歩により、高解像度の3D形状を生成する有望な結果が示されていますが、VDMは依然として高速生成に苦戦しています。この課題は、拡散サンプリングの加速だけでなく、VDMにおけるVAEデコードの困難さにも起因しており、これまでの研究では十分に探求されていない領域です。これらの課題に対処するため、我々はFlashVDMを提案します。これは、VDMにおけるVAEとDiTの両方を加速するための体系的なフレームワークです。DiTに関しては、FlashVDMはわずか5ステップの推論で同等の品質を実現する柔軟な拡散サンプリングを可能にします。これは、新たに導入したProgressive Flow Distillationによる一貫性蒸留の安定化によって実現されています。VAEに関しては、Adaptive KV Selection、Hierarchical Volume Decoding、およびEfficient Network Designを備えた軽量なvecsetデコーダを導入します。vecsetの局所性と体積内の形状表面の疎性を活用することで、我々のデコーダはFLOPsを大幅に削減し、全体的なデコードのオーバーヘッドを最小限に抑えます。我々はFlashVDMをHunyuan3D-2に適用し、Hunyuan3D-2 Turboを実現しました。体系的な評価を通じて、我々のモデルが既存の高速3D生成手法を大幅に上回り、最先端の性能に匹敵しながら、再構築において45倍以上、生成において32倍以上の推論時間を短縮することを示します。コードとモデルはhttps://github.com/Tencent/FlashVDMで公開されています。

English

3D shape generation has greatly flourished through the development of so-called "native" 3D diffusion, particularly through the Vecset Diffusion Model (VDM). While recent advancements have shown promising results in generating high-resolution 3D shapes, VDM still struggles with high-speed generation. Challenges exist because of difficulties not only in accelerating diffusion sampling but also VAE decoding in VDM, areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. For DiT, FlashVDM enables flexible diffusion sampling with as few as 5 inference steps and comparable quality, which is made possible by stabilizing consistency distillation with our newly introduced Progressive Flow Distillation. For VAE, we introduce a lightning vecset decoder equipped with Adaptive KV Selection, Hierarchical Volume Decoding, and Efficient Network Design. By exploiting the locality of the vecset and the sparsity of shape surface in the volume, our decoder drastically lowers FLOPs, minimizing the overall decoding overhead. We apply FlashVDM to Hunyuan3D-2 to obtain Hunyuan3D-2 Turbo. Through systematic evaluation, we show that our model significantly outperforms existing fast 3D generation methods, achieving comparable performance to the state-of-the-art while reducing inference time by over 45x for reconstruction and 32x for generation. Code and models are available at https://github.com/Tencent/FlashVDM.