LoGeR：基于混合内存的长上下文几何重建

摘要

前馈式几何基础模型在短窗口重建中表现优异，但将其扩展至分钟级视频时，受限于二次注意力复杂度或循环设计中有限的有效内存。我们提出LoGeR（长上下文几何重建）——一种无需后优化即可将稠密三维重建扩展至超长序列的新架构。LoGeR通过分块处理视频流，利用强双向先验实现高保真度的块内推理。为应对分块边界连贯性这一关键挑战，我们设计了基于学习的混合记忆模块。该双组件系统结合了参数化测试时训练（TTT）记忆模块以锚定全局坐标系防止尺度漂移，同时采用非参数化滑动窗口注意力（SWA）机制保存未压缩上下文以实现高精度邻接对齐。值得注意的是，该记忆架构使LoGeR仅需在128帧序列上训练，即可在推理时泛化至数千帧。在标准基准和重新构建的VBR数据集（含高达1.9万帧序列）上的评估表明，LoGeR显著超越现有最优前馈方法——将KITTI上的ATE降低超74%——并在前所未有的时间跨度上实现鲁棒且全局一致的重建。

English

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

LoGeR：基于混合内存的长上下文几何重建

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

摘要

Support