Steuerbare Text-zu-3D-Generierung über oberflächenangepasstes Gaussches Splatting

papers.abstract

Während Text-zu-3D- und Bild-zu-3D-Generierungsaufgaben beträchtliche Aufmerksamkeit erhalten haben, ist ein wichtiges, aber unterforschtes Feld zwischen ihnen die kontrollierbare Text-zu-3D-Generierung, auf die wir uns hauptsächlich in dieser Arbeit konzentrieren. Um diese Aufgabe anzugehen, 1) stellen wir Multi-View ControlNet (MVControl) vor, eine neuartige neuronale Netzwerkarchitektur, die entwickelt wurde, um bestehende vortrainierte Multi-View-Diffusionsmodelle zu verbessern, indem zusätzliche Eingabekonditionen integriert werden, wie Kanten, Tiefe, Normalen und Skizzenkarten. Unsere Innovation liegt in der Einführung eines Konditionierungsmoduls, das das Basis-Diffusionsmodell mithilfe von lokalen und globalen Einbettungen steuert, die aus den Eingabekonditionsbildern und Kamerapositionen berechnet werden. Nach dem Training ist MVControl in der Lage, 3D-Diffusionsanleitungen für die optimierungsbasierte 3D-Generierung anzubieten. Und, 2) schlagen wir eine effiziente mehrstufige 3D-Generierungspipeline vor, die von den Vorteilen aktueller großer Rekonstruktionsmodelle und des Score-Destillationsalgorithmus profitiert. Aufbauend auf unserer MVControl-Architektur verwenden wir eine einzigartige hybride Diffusionsführungsmethode, um den Optimierungsprozess zu lenken. Auf der Suche nach Effizienz verwenden wir 3D-Gaußsche als unsere Repräsentation anstelle der üblicherweise verwendeten impliziten Repräsentationen. Wir sind auch Vorreiter bei der Verwendung von SuGaR, einer hybriden Repräsentation, die Gaußsche an die Dreiecksflächen des Gitternetzes bindet. Dieser Ansatz lindert das Problem schlechter Geometrie in 3D-Gaußschen und ermöglicht das direkte Modellieren von feinkörniger Geometrie auf dem Gitternetz. Umfangreiche Experimente zeigen, dass unsere Methode robuste Verallgemeinerung erreicht und die kontrollierte Generierung hochwertiger 3D-Inhalte ermöglicht.

English

While text-to-3D and image-to-3D generation tasks have received considerable attention, one important but under-explored field between them is controllable text-to-3D generation, which we mainly focus on in this work. To address this task, 1) we introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing pre-trained multi-view diffusion models by integrating additional input conditions, such as edge, depth, normal, and scribble maps. Our innovation lies in the introduction of a conditioning module that controls the base diffusion model using both local and global embeddings, which are computed from the input condition images and camera poses. Once trained, MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation. And, 2) we propose an efficient multi-stage 3D generation pipeline that leverages the benefits of recent large reconstruction models and score distillation algorithm. Building upon our MVControl architecture, we employ a unique hybrid diffusion guidance method to direct the optimization process. In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations. We also pioneer the use of SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces. This approach alleviates the issue of poor geometry in 3D Gaussians and enables the direct sculpting of fine-grained geometry on the mesh. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content.

Steuerbare Text-zu-3D-Generierung über oberflächenangepasstes Gaussches Splatting

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

papers.abstract

Support