EVA01: Transformer混合による統合ネイティブ3D理解と生成

要旨

本論文は、3Dメッシュをマルチモーダル大規模言語モデル（MLLM）のネイティブモダリティとして統合する課題に取り組む。拡散ベースの大規模再構築モデルは、意味理解と幾何学的推論を分離し、密な2Dピクセル事前分布に基づくステートレスな再構築器として動作する。近年のMLLMベースの手法は、3Dモダリティをマルチモーダル系列のネイティブコンポーネントではなく外部出力として扱い、幾何学的多様体がMLLMの特徴空間とどのように整合するかについての体系的な分析を行わずに、漸進的な適応を施している。我々は、MLLMのモダリティ境界を拡張し、3Dメッシュの理解、生成、およびコンテキスト認識型編集をネイティブに組み込む統一フレームワークであるEVA01を紹介する。Mixture-of-Transformers（MoT）アーキテクチャに基づくEVA01は、モデルを事前学習済みの理解エキスパート（E_{und}）と構造的にミラーリングされた生成エキスパート（E_{gen}）に分離し、ハードモダリティルーティングを伴う共有グローバル自己注意機構を通じて結合する。この設計により、MLLMバックボーンの意味的潜在空間が幾何学的多様体と整合し、中間的な2D表現を介さずにマルチモーダル事前分布の直接転送が可能となる。結果は、EVA01が最先端のネイティブテキストから3D生成の忠実度を達成し、同一性を保持した堅牢な長コンテキストマルチターン幾何学的編集を実現することを示す。これは、ステートレスな再構築パイプラインでは根本的にアクセス不可能な能力である。我々の知見はさらに、2D基盤モデルと3Dタスクの統合に関するアーキテクチャ上の洞察を提供し、3Dネイティブマルチモーダルシステムの設計に寄与する。プロジェクトページ：https://www.seeles.ai/research/pages/EVA01

English

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert (E_{und}) and a structurally mirrored Generation Expert (E_{gen}), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01