Omni123：通过统一文本到2D与3D生成探索有限3D数据下的三维原生基础模型

摘要

当前多模态大语言模型已在图文理解与生成任务上取得显著进展，但由于数据稀缺，将此类原生能力扩展至三维领域仍具挑战。相较于海量的二维图像，高质量三维资产极为有限，导致三维合成任务存在约束不足的问题。现有方法多依赖间接流程，即先在二维空间编辑再通过优化将结果提升至三维，这种方式会牺牲几何一致性。我们提出Omni123——一种三维原生的基础模型，通过自回归框架统一文本到二维与文本到三维的生成任务。核心发现在于：图像与三维数据间的跨模态一致性可作为隐式结构约束。通过将文本、图像和三维数据表示为共享序列空间中的离散标记，模型能够利用丰富的二维数据作为几何先验来优化三维表征。我们引入交错式X到X训练范式，在异构配对数据集上协调多种跨模态任务，无需完全对齐的文本-图像-三维三元组。通过在自回归序列中实现语义-视觉-几何循环（如文本→图像→三维→图像），模型可联合强化语义对齐、外观保真度与多视角几何一致性。实验表明，Omni123在文本引导的三维生成与编辑任务上实现显著提升，为构建多模态三维世界模型提供了可扩展路径。

English

Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

Omni123：通过统一文本到2D与3D生成探索有限3D数据下的三维原生基础模型

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

摘要

Support