EVA01: 通过混合Transformer实现统一原生3D理解与生成

摘要

本文提出了一种将3D网格作为原生模态集成到多模态大语言模型（MLLMs）中的挑战性方法。基于扩散的大型重建模型将语义理解与几何推理解耦，作为以稠密2D像素先验为条件的无状态重建器运行。近期基于MLLM的方法将3D模态视为外部输出而非多模态序列的原生组件，采取增量式调整，缺乏对几何流形与MLLM特征空间对齐的系统性分析。我们提出EVA01，这是一个统一框架，将MLLM的模态边界扩展至原生整合3D网格的理解、生成及上下文感知编辑。EVA01基于混合变换器（Mixture-of-Transformers, MoT）架构构建，将模型解耦为预训练的"理解专家"（E\_und）和结构镜像的"生成专家"（E\_gen），两者通过共享全局自注意力机制与硬模态路由耦合。该设计使MLLM主干网络的语义隐空间与几何流形对齐，从而无需中间2D表征即可直接迁移多模态先验。结果表明，EVA01在原生文本到3D生成保真度上达到最优水平，并实现了鲁棒的长上下文多轮几何编辑与身份保持能力，这是无状态重建流水线从根本上无法企及的功能。我们的发现进一步为2D基础模型与3D任务的集成提供了架构洞见，为3D原生多模态系统的设计提供了参考。项目页面：https://www.seeles.ai/research/pages/EVA01

English

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert (E_{und}) and a structurally mirrored Generation Expert (E_{gen}), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01