用於3D生成模型的Mosaic-SDF
Mosaic-SDF for 3D Generative Models
December 14, 2023
作者: Lior Yariv, Omri Puny, Natalia Neverova, Oran Gafni, Yaron Lipman
cs.AI
摘要
目前用於3D形狀的擴散或基於流的生成模型可分為兩種:提煉預先訓練的2D圖像擴散模型,以及直接在3D形狀上進行訓練。在對3D形狀進行擴散或流模型訓練時,一個至關重要的設計選擇是形狀表示法。一種有效的形狀表示法需要遵循三個設計原則:它應允許將大型3D數據集有效轉換為表示形式;它應提供良好的近似能力與參數數量之間的折衷;並且它應具有與現有強大神經結構相容的簡單張量形式。儘管標準的3D形狀表示法,如體積網格和點雲,無法同時遵循所有這些原則,但我們在本文中提倡一種新的表示法,即Mosaic-SDF(M-SDF)。M-SDF是一種簡單的3D形狀表示法,通過使用分佈在形狀邊界附近的一組局部網格來近似給定形狀的符號距離函數(SDF)。M-SDF表示法對於每個形狀的計算速度快,使其易於並行化;它在參數效率上效果顯著,因為它僅涵蓋形狀周圍的空間;並且它具有簡單的矩陣形式,與基於Transformer的結構相容。我們通過使用M-SDF表示法來訓練一個包括類別條件生成的3D生成流模型來展示其有效性,其中使用了3D Warehouse數據集,以及使用約600k標題-形狀對的數據集進行文本到3D生成。
English
Current diffusion or flow-based generative models for 3D shapes divide to
two: distilling pre-trained 2D image diffusion models, and training directly on
3D shapes. When training a diffusion or flow models on 3D shapes a crucial
design choice is the shape representation. An effective shape representation
needs to adhere three design principles: it should allow an efficient
conversion of large 3D datasets to the representation form; it should provide a
good tradeoff of approximation power versus number of parameters; and it should
have a simple tensorial form that is compatible with existing powerful neural
architectures. While standard 3D shape representations such as volumetric grids
and point clouds do not adhere to all these principles simultaneously, we
advocate in this paper a new representation that does. We introduce Mosaic-SDF
(M-SDF): a simple 3D shape representation that approximates the Signed Distance
Function (SDF) of a given shape by using a set of local grids spread near the
shape's boundary. The M-SDF representation is fast to compute for each shape
individually making it readily parallelizable; it is parameter efficient as it
only covers the space around the shape's boundary; and it has a simple matrix
form, compatible with Transformer-based architectures. We demonstrate the
efficacy of the M-SDF representation by using it to train a 3D generative flow
model including class-conditioned generation with the 3D Warehouse dataset, and
text-to-3D generation using a dataset of about 600k caption-shape pairs.