Token瓶頸：單一Token記住動態

摘要

從動態場景中提取緊湊且具有時間感知的視覺表徵，對於成功執行視覺追蹤和機器人操作等序列場景理解任務至關重要。本文介紹了Token Bottleneck（ToBo），這是一種簡單而直觀的自監督學習流程，它將場景壓縮成一個瓶頸token，並使用最少的圖像塊作為提示來預測後續場景。ToBo流程通過在壓縮步驟中將參考場景保守地編碼為一個緊湊的瓶頸token，促進了序列場景表徵的學習。在擴展步驟中，我們引導模型通過使用瓶頸token以及少量目標圖像塊作為提示來預測目標場景，從而捕捉時間動態。這一設計鼓勵視覺骨幹網絡嵌入時間依賴性，從而實現對場景間動態轉變的理解。在包括視頻標籤傳播和模擬環境中的機器人操作等多樣化序列任務中的廣泛實驗，展示了ToBo相較於基線方法的優越性。此外，將我們預訓練的模型部署在實體機器人上，證實了其在真實環境中的魯棒性和有效性。我們進一步驗證了ToBo在不同模型規模上的可擴展性。

English

Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the expansion step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of ToBo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales.

Token瓶頸：單一Token記住動態

Token Bottleneck: One Token to Remember Dynamics

摘要

Support