EVA01：通過混合Transformer實現統一原生3D理解與生成

摘要

本文探討了將3D網格作為原生模態整合至多模態大型語言模型（MLLMs）中的挑戰。基於擴散的大型重建模型將語義理解與幾何推理分離，作為以稠密2D像素先驗為條件的無狀態重建器運作。近期基於MLLM的方法將3D模態視為外部輸出，而非多模態序列的原生組成部分，因此在未系統分析幾何流形如何與MLLM特徵空間對齊的情況下，僅進行增量式調整。我們提出EVA01，這是一個統一框架，將MLLM的模態邊界擴展至原生整合3D網格的理解、生成及情境感知編輯。EVA01基於混合專家變壓器（Mixture-of-Transformers, MoT）架構構建，將模型拆分為預訓練的理解專家（E_{und}）與結構鏡像的生成專家（E_{gen}），兩者透過共享的全域自我注意力與硬模態路由耦合。此設計使MLLM主幹的語義潛在空間與幾何流形對齊，無需中介2D表徵即可直接傳遞多模態先驗。實驗結果顯示，EVA01達到了原生文字到3D生成保真度的最先進水準，並實現了具備身份保留能力的穩健長上下文多輪幾何編輯，此能力在無狀態重建管道中根本無法實現。我們的研究進一步為將2D基礎模型與3D任務整合提供了架構性見解，為3D原生多模態系統的設計提供參考。專案頁面：https://www.seeles.ai/research/pages/EVA01

English

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert (E_{und}) and a structurally mirrored Generation Expert (E_{gen}), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01