單一標元實現像素級場景理解:視覺狀態需要「何物在何處」的組合結構
Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition
March 14, 2026
作者: Seokmin Lee, Yunghee Lee, Byeonghyun Pak, Byeongju Woo
cs.AI
摘要
在动态环境中运行的机器人智能体,必须从流式视频观测中学习视觉状态表征,才能实现有效的序列决策。近期自监督学习方法在视觉任务间展现出强大的迁移能力,但这些方法并未明确阐释优质视觉状态应编码何种信息。我们认为有效的视觉状态必须通过联合编码场景元素的语义身份及其空间位置,来捕捉"何物在何处"的信息,从而可靠检测观测间的细微动态变化。为此,我们提出CroBo——基于全局到局部重建目标的视觉状态表征学习框架。该方法将参考观测压缩为紧凑的瓶颈标记,利用该全局瓶颈标记作为上下文,通过学习从稀疏可见线索重建局部目标裁剪区域中重度掩码的图像块。这种学习目标促使瓶颈标记编码场景级语义实体的细粒度表征,包括其身份、空间位置与配置关系。最终,习得的视觉状态能揭示场景元素随时间推移的运动与交互方式,为序列决策提供支持。我们在多个基于视觉的机器人策略学习基准上评估CroBo,其性能达到领先水平。重建分析与感知直线性实验进一步表明,所学表征能保持像素级场景构成,并编码观测间"何物移向何处"的信息。项目页面详见:https://seokminlee-chris.github.io/CroBo-ProjectPage。
English
For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations. Project page available at: https://seokminlee-chris.github.io/CroBo-ProjectPage.