時空TTT：基於串流視覺的空間智能與測試時訓練

摘要

人類透過連續的視覺觀測來感知和理解現實空間。因此，從潛在無限的影片串流中持續維護並更新空間證據的能力，對於空間智能至關重要。核心挑戰不僅在於更長的上下文窗口，更在於如何隨時間選擇、組織和保留空間資訊。本文提出基於測試時訓練的串流視覺空間智能方法Spatial-TTT，該方法通過調整部分參數（快速權重）來捕捉並組織長時序場景影片中的空間證據。具體而言，我們設計了混合架構，採用大區塊更新與滑動窗口注意力並行的機制，以實現高效的空間影片處理。為進一步增強空間感知能力，我們在TTT層引入結合3D時空卷積的空間預測機制，促使模型捕捉跨幀的幾何對應關係與時間連續性。除架構設計外，我們構建了帶有密集3D空間描述的數據集，引導模型通過快速權重更新以結構化方式記憶並組織全域3D空間信號。大量實驗表明，Spatial-TTT能提升長時序空間理解能力，並在影片空間基準測試中達到最先進性能。項目頁面：https://liuff19.github.io/Spatial-TTT。

English

Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.

時空TTT：基於串流視覺的空間智能與測試時訓練

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

摘要

Support