EVA01: 트랜스포머 혼합을 통한 통합 네이티브 3D 이해 및 생성

초록

본 논문은 3D 메시를 다중모달 대규모 언어 모델(MLLM) 내에서 고유한 양식(native modality)으로 통합하는 과제를 다룬다. 확산 기반 대규모 재구성 모델은 의미적 이해와 기하학적 추론을 분리하여, 조밀한 2D 픽셀 사전 정보에 기반한 상태 비의존적 재구성기(stateless reconstructor)로 작동한다. 최근 MLLM 기반 방법들은 3D 양식을 다중모달 시퀀스의 고유 구성 요소가 아닌 외부 출력으로 취급하며, 기하학적 다양체가 MLLM 특징 공간과 어떻게 정렬되는지에 대한 체계적 분석 없이 점진적 적응만을 수행한다. 본 연구는 MLLM의 양식 경계를 확장하여 3D 메시 이해, 생성, 및 맥락 인식 편집을 고유하게 통합하는 통합 프레임워크인 EVA01을 소개한다. 변환기 혼합(MoT) 아키텍처를 기반으로 구축된 EVA01은 모델을 사전 훈련된 이해 전문가(E_{und})와 구조적으로 대칭된 생성 전문가(E_{gen})로 분리하며, 하드 양식 라우팅(hard modality routing)이 적용된 공유 전역 자기 주의(self-attention)를 통해 이들을 결합한다. 이러한 설계는 MLLM 백본의 의미적 잠재 공간을 기하학적 다양체와 정렬시켜, 중간 2D 표현 없이 다중모달 사전 정보의 직접적인 전이를 가능하게 한다. 실험 결과, EVA01은 최첨단 고유 텍스트-3D 생성 충실도를 달성하고, 정체성 보존이 가능한 강건한 장문맥 다중 턴 기하학적 편집을 가능하게 하는데, 이는 상태 비의존적 재구성 파이프라인에서는 근본적으로 접근 불가능한 기능이다. 본 연구의 발견은 2D 기반 모델을 3D 작업에 통합하기 위한 아키텍처적 통찰을 제공하며, 3D 고유 다중모달 시스템 설계에 기여한다. 프로젝트 페이지: https://www.seeles.ai/research/pages/EVA01

English

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert (E_{und}) and a structurally mirrored Generation Expert (E_{gen}), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01