BA-T：一種用於雙視角束調整的迭代Transformer

摘要

用於三維重建的前饋模型已透過深度跨視角注意力機制進行影像間資訊交換，展現出優異效能。然而，此類方法通常依賴深層解碼器堆疊，且缺乏結構化的幾何優化機制，導致多視角一致性不佳。為解決此問題，我們從經典的束調整（BA）中汲取靈感——該方法可視為姿態與局部幾何之間反覆傳播資訊的迭代過程。受BA啟發，我們提出BA-T，一種迭代式Transformer，將BA風格的結構化更新實作為隱式令牌空間中可重複使用的層。不同於依賴深層注意力堆疊，BA-T透過單一輕量層根據潛在殘差逐步精煉預測結果。實驗顯示，BA-T在多次迭代中逐步提升姿態與重建精度，相較傳統解碼器達成更強的跨視角一致性，且在使用僅16%解碼器參數的條件下，能超越或持平規模顯著更大的模型。BA-T為深度密集型注意力機制提供了緊湊、高效且具結構性的替代方案，使輕量架構中亦能實現精確的三維重建。程式碼將於https://github.com/zhangganlin/BA-T 公開。

English

Feed-forward models for 3D reconstruction have achieved strong performance using deep cross-view attention to exchange information across images. However, these approaches often depend on heavy decoder stacks and lack a structured mechanism for geometry refinement, resulting in poor multi-view consistency. We address this by drawing inspiration from classical bundle adjustment (BA), which can be viewed as an iterative information propagation process between poses and local geometry. Inspired by BA, we propose BA-T, an iterative Transformer that implements BA-style structured updates as a repeatable layer in implicit token space. Instead of relying on deep attention stacks, BA-T refines predictions based on latent residual by a single lightweight layer. Experiments demonstrate that BA-T progressively improves pose and reconstruction accuracy across iterations, achieves stronger cross-view consistency than conventional decoders, and matches or surpasses substantially larger models while using only 16% of their decoder parameters. BA-T provides a compact, efficient, and structural alternative to depth-heavy attention, enabling accurate 3D reconstruction within a lightweight architecture. The code will be made publicly at https://github.com/zhangganlin/BA-T.