令牌瓶颈:单令牌记忆动态机制
Token Bottleneck: One Token to Remember Dynamics
July 9, 2025
作者: Taekyung Kim, Dongyoon Han, Byeongho Heo, Jeongeun Park, Sangdoo Yun
cs.AI
摘要
从动态场景中提取紧凑且具有时间感知性的视觉表征,对于成功执行视觉跟踪和机器人操作等序列场景理解任务至关重要。本文提出了一种名为Token Bottleneck(ToBo)的简洁而直观的自监督学习流程,该流程通过将场景压缩至瓶颈令牌,并利用少量图像块作为提示来预测后续场景。在压缩阶段,ToBo流程通过保守地将参考场景编码为紧凑的瓶颈令牌,促进了序列场景表征的学习。在扩展阶段,我们引导模型利用瓶颈令牌及少量目标图像块作为提示来预测目标场景,从而捕捉时间动态。这一设计鼓励视觉骨干网络嵌入时间依赖性,进而实现对场景间动态转换的理解。在包括视频标签传播和模拟环境中的机器人操作等多种序列任务中的广泛实验表明,ToBo相较于基线方法具有显著优势。此外,将我们预训练的模型部署于实体机器人上,验证了其在真实环境中的鲁棒性和有效性。我们还进一步验证了ToBo在不同模型规模下的可扩展性。
English
Deriving compact and temporally aware visual representations from dynamic
scenes is essential for successful execution of sequential scene understanding
tasks such as visual tracking and robotic manipulation. In this paper, we
introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised
learning pipeline that squeezes a scene into a bottleneck token and predicts
the subsequent scene using minimal patches as hints. The ToBo pipeline
facilitates the learning of sequential scene representations by conservatively
encoding the reference scene into a compact bottleneck token during the squeeze
step. In the expansion step, we guide the model to capture temporal dynamics by
predicting the target scene using the bottleneck token along with few target
patches as hints. This design encourages the vision backbone to embed temporal
dependencies, thereby enabling understanding of dynamic transitions across
scenes. Extensive experiments in diverse sequential tasks, including video
label propagation and robot manipulation in simulated environments demonstrate
the superiority of ToBo over baselines. Moreover, deploying our pre-trained
model on physical robots confirms its robustness and effectiveness in
real-world environments. We further validate the scalability of ToBo across
different model scales.