한 개의 토큰으로 구현하는 픽셀 수준 장면 이해: 시각적 상태는 '무엇이 어디에 있는지' 구성이 필요하다

초록

동적 환경에서 작동하는 로봇 에이전트에게 순차적 의사 결정을 위해서는 스트리밍 비디오 관측에서 시각적 상태 표현을 학습하는 것이 필수적입니다. 최근의 자기 지도 학습 방법들은 비전 작업 간 강력한 전이 능력을 보여주지만, 좋은 시각적 상태가 무엇을 인코딩해야 하는지 명시적으로 다루지 않습니다. 우리는 효과적인 시각적 상태가 장면 요소들의 의미론적 정체성과 공간적 위치를 공동으로 인코딩하여 '무엇이 어디에 있는지(what-is-where)'를 포착해야 하며, 이를 통해 관측 간 미세한 동역학을 안정적으로 감지할 수 있어야 한다고 주장합니다. 이를 위해 우리는 전역-국부 재구성 목표를 기반으로 하는 시각적 상태 표현 학습 프레임워크인 CroBo를 제안합니다. 컴팩트한 병목 토큰으로 압축된 참조 관측이 주어지면, CroBo는 희소한 가시적 단서를 사용하여 국부 대상 크롭에서 심하게 마스킹된 패치들을 재구성하는 방법을 학습하며, 전역 병목 토큰을 컨텍스트로 활용합니다. 이 학습 목표는 병목 토큰이 장면 전반의 의미론적 개체들에 대한 세밀한 표현, 즉 그들의 정체성, 공간적 위치, 배치를 인코딩하도록 유도합니다. 그 결과, 학습된 시각적 상태는 시간에 따라 장면 요소들이 어떻게 이동하고 상호작용하는지를 보여주어 순차적 의사 결정을 지원합니다. 우리는 CroBo를 다양한 비전 기반 로봇 정책 학습 벤치마크에서 평가하며, 이 방법이 최첨단 성능을 달성함을 입증합니다. 재구성 분석과 지각적 직진성(perceptual straightness) 실험을 통해 학습된 표현이 픽셀 수준의 장면 구성을 보존하고 관측 간 '무엇이 어디로 이동하는지(what-moves-where)'를 인코딩함을 추가로 확인합니다. 프로젝트 페이지는 https://seokminlee-chris.github.io/CroBo-ProjectPage에서 확인할 수 있습니다.

English

For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations. Project page available at: https://seokminlee-chris.github.io/CroBo-ProjectPage.

한 개의 토큰으로 구현하는 픽셀 수준 장면 이해: 시각적 상태는 '무엇이 어디에 있는지' 구성이 필요하다

Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

초록

Support