双专家优于通才：2Xplat模式探析

摘要

无姿态前馈三维高斯溅射(3DGS)技术为快速三维建模开辟了新领域，使得无需标定的多视角图像仅通过单次前向传播即可生成高质量的高斯表征。该领域的主流方法采用统一单体架构——通常基于以几何为核心的三维基础模型——在单一网络内联合估计相机姿态并合成3DGS表征。尽管这种"一体化"设计在架构上较为简洁，但由于其将几何推理与外观建模纠缠于共享表征中，可能难以实现高保真度的3DGS生成。本文提出2Xplat框架，这是一种基于双专家设计的无姿态前馈3DGS系统，其核心创新在于将几何估计与高斯生成显式分离：专用几何专家首先预测相机姿态，随后将这些姿态明确传递给负责合成三维高斯的外观专家。尽管该方案概念简洁且在先前研究中未被充分探索，但实践证明其极具效力。在不足5000次训练迭代的情况下，这一双专家流程显著超越了既往的无姿态前馈3DGS方法，其性能甚至可与需要预设姿态的先进方法相媲美。这些成果对当前主流的一体化范式提出了挑战，揭示了模块化设计原则在复杂三维几何估计与外观合成任务中的潜在优势。

English

Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such "all-in-one" designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.