1トークンでのピクセルレベルシーン理解：視覚状態は「何がどこにあるか」の構成を必要とする

要旨

動的環境で動作するロボットエージェントにとって、ストリーミングビデオ観測から視覚的状态表現を学習することは、逐次意思決定において不可欠である。近年の自己教師あり学習手法は視覚タスク間での強い転移性を示しているが、優れた視覚的状态が何を符号化すべきかについては明示的に扱っていない。我々は、効果的な視覚的状态が、シーン要素の意味的同一性とその空間的位置を共同で符号化することにより「何がどこにあるか」を捕捉し、観測間の微妙な動態の確実な検出を可能にすべきであると論じる。この目的に向けて、我々は大域から局所への再構成目的関数に基づく視覚的状态表現学習フレームワークCroBoを提案する。コンパクトなボトルネックトークンに圧縮された参照観測が与えられた下で、CroBoは、大域的なボトルネックトークンを文脈として利用し、疎な可視手がかりから局所的な対象クロップ内の大幅にマスクされたパッチを再構成することを学習する。この学習目的は、ボトルネックトークンが、シーン全体の意味的実体（それらの同一性、空間的位置、配置を含む）のきめ細かい表現を符号化することを促進する。結果として、学習された視覚的状态は、時間の経過とともにシーン要素がどのように移動し相互作用するかを明らかにし、逐次意思決定を支援する。我々はCroBoを多様な視覚ベースロボット政策学習ベンチマークで評価し、そこでState-of-the-Artの性能を達成した。再構成分析と知覚的直線性実験はさらに、学習された表現がピクセルレベルのシーン構成を保持し、観測間で「何がどこに動くか」を符号化することを示す。プロジェクトページは以下：https://seokminlee-chris.github.io/CroBo-ProjectPage

English

For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations. Project page available at: https://seokminlee-chris.github.io/CroBo-ProjectPage.

1トークンでのピクセルレベルシーン理解：視覚状態は「何がどこにあるか」の構成を必要とする

Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

要旨

Support