时空TTT：基于视觉的流式空间智能与测试时训练

摘要

人类通过连续的视觉观察来感知和理解现实空间。因此，从潜在无限的视频流中持续维护和更新空间证据的能力，对于空间智能至关重要。核心挑战不仅在于更长的上下文窗口，更在于如何随时间推移选择、组织和保留空间信息。本文提出基于测试时训练（TTT）的Spatial-TTT方法，通过自适应调整部分参数（快速权重）来捕获并组织长时序场景视频中的空间证据。具体而言，我们设计了混合架构，采用大块更新与滑动窗口注意力并行的机制以实现高效的空间视频处理。为进一步增强空间感知能力，我们在TTT层引入结合3D时空卷积的空间预测机制，促使模型捕捉跨帧的几何对应关系与时间连续性。除架构设计外，我们还构建了包含密集3D空间描述的数据集，指导模型通过快速权重的更新以结构化方式记忆并组织全局3D空间信号。大量实验表明，Spatial-TTT显著提升了长时序空间理解能力，在视频空间基准测试中达到了最先进性能。项目页面：https://liuff19.github.io/Spatial-TTT。

English

Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.