令牌瓶颈：单令牌记忆动态机制

摘要

从动态场景中提取紧凑且具有时间感知性的视觉表征，对于成功执行视觉跟踪和机器人操作等序列场景理解任务至关重要。本文提出了一种名为Token Bottleneck（ToBo）的简洁而直观的自监督学习流程，该流程通过将场景压缩至瓶颈令牌，并利用少量图像块作为提示来预测后续场景。在压缩阶段，ToBo流程通过保守地将参考场景编码为紧凑的瓶颈令牌，促进了序列场景表征的学习。在扩展阶段，我们引导模型利用瓶颈令牌及少量目标图像块作为提示来预测目标场景，从而捕捉时间动态。这一设计鼓励视觉骨干网络嵌入时间依赖性，进而实现对场景间动态转换的理解。在包括视频标签传播和模拟环境中的机器人操作等多种序列任务中的广泛实验表明，ToBo相较于基线方法具有显著优势。此外，将我们预训练的模型部署于实体机器人上，验证了其在真实环境中的鲁棒性和有效性。我们还进一步验证了ToBo在不同模型规模下的可扩展性。

English

Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the expansion step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of ToBo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales.

令牌瓶颈：单令牌记忆动态机制

Token Bottleneck: One Token to Remember Dynamics

摘要

Support