BA-T: 一种用于双视图束调整的迭代Transformer

摘要

面向3D重建的前馈模型通过深度跨视图注意力机制在图像间交换信息已取得显著性能，但这类方法通常依赖厚重的解码器堆叠，且缺乏用于几何优化的结构化机制，导致多视图一致性较差。本文受经典光束法平差（BA）启发——该过程可视为位姿与局部几何之间的迭代信息传播——提出BA-T迭代式Transformer，将BA风格的结构化更新作为可重复层在隐式令牌空间中实现。BA-T不依赖深层注意力堆叠，而是通过单个轻量化层基于潜在残差逐步优化预测。实验表明，BA-T在迭代中逐步提升位姿与重建精度，相比传统解码器实现更强的跨视图一致性，且仅使用其16%的解码器参数即可媲美或超越规模显著更大的模型。BA-T为深度注意力机制提供了紧凑、高效且结构化的替代方案，使轻量级架构实现精确3D重建成为可能。代码将开源至https://github.com/zhangganlin/BA-T。

English

Feed-forward models for 3D reconstruction have achieved strong performance using deep cross-view attention to exchange information across images. However, these approaches often depend on heavy decoder stacks and lack a structured mechanism for geometry refinement, resulting in poor multi-view consistency. We address this by drawing inspiration from classical bundle adjustment (BA), which can be viewed as an iterative information propagation process between poses and local geometry. Inspired by BA, we propose BA-T, an iterative Transformer that implements BA-style structured updates as a repeatable layer in implicit token space. Instead of relying on deep attention stacks, BA-T refines predictions based on latent residual by a single lightweight layer. Experiments demonstrate that BA-T progressively improves pose and reconstruction accuracy across iterations, achieves stronger cross-view consistency than conventional decoders, and matches or surpasses substantially larger models while using only 16% of their decoder parameters. BA-T provides a compact, efficient, and structural alternative to depth-heavy attention, enabling accurate 3D reconstruction within a lightweight architecture. The code will be made publicly at https://github.com/zhangganlin/BA-T.