EVA01: Geünificeerd Native 3D-begrip en -generatie via Mixture-of-Transformers

Samenvatting

Dit artikel behandelt de uitdaging van het integreren van 3D-meshes als een native modaliteit binnen Multimodale Grote Taalmodellen (MLLM's). Diffusie-gebaseerde grote reconstructiemodellen ontkoppelen semantisch begrip van geometrische redenering en functioneren als toestandloze reconstructoren die worden geconditioneerd door dichte 2D-pixelpriors. Recent op MLLM gebaseerde methoden behandelen de 3D-modaliteit als een externe output in plaats van een native component van de multimodale sequentie, en maken incrementele aanpassingen zonder een systematische analyse van hoe geometrische manifolds zich uitlijnen met MLLM-kenmerkruimten. We introduceren EVA01, een uniform raamwerk dat de modaliteitsgrens van MLLM's uitbreidt om native 3D-mesh begrip, generatie en contextbewuste bewerking te integreren. Gebouwd op een Mixture-of-Transformers (MoT)-architectuur, ontkoppelt EVA01 het model in een voorgetrainde Begripsexpert (E_{und}) en een structureel gespiegelde Generatie-expert (E_{gen}), gekoppeld via gedeelde globale self-attention met harde modaliteitsroutering. Dit ontwerp lijnt de semantische latente ruimte van de MLLM-backbone uit met het geometrische manifold, wat directe overdracht van multimodale priors mogelijk maakt zonder tussenliggende 2D-representaties. Resultaten tonen aan dat EVA01 state-of-the-art native tekst-naar-3D generatiefideliteit bereikt en robuuste lange-context multi-beurt geometrische bewerking met identiteitsbehoud mogelijk maakt, een capaciteit die fundamenteel ontoegankelijk is voor toestandloze reconstructiepijplijnen. Onze bevindingen bieden verder architectonische inzichten voor het integreren van 2D-fundamentmodellen met 3D-taken, en informeren het ontwerp van 3D-native multimodale systemen. Projectpagina: https://www.seeles.ai/research/pages/EVA01

English

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert (E_{und}) and a structurally mirrored Generation Expert (E_{gen}), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01