ロボット制御のためのインコンテキスト世界モデリング

要旨

現代のVision-Language-Action (VLA)モデルは、カメラ視点の変更やロボットの形態変更といった新しい設定への一般化にしばしば失敗する。これは、通常、現在の観測と言語指示のみに条件付けられているためである。基礎となるシステム構成を変数として無視することで、これらのモデルは訓練中に遭遇した固定された実行コンテキストを暗黙的に仮定しており、新しい環境ごとにデータ集約的なファインチューニングを必要とする。本研究では、システム同定をインコンテキスト適応問題として扱うフレームワーク、In-Context World Modeling (ICWM)を紹介する。ICWMは、ロボットポリシーが、自己生成されたタスク非依存の相互作用の短い履歴から、自律的にシステムの重要な変数を推論することを可能にする。どのタスクを実行するかを指定するためにデモンストレーションを使用する従来のインコンテキスト学習とは異なり、ICWMはコンテキストウィンドウを活用してシステムがどのように動作するかを理解する。タスク実行前にこれらの相互作用を処理することで、モデルは現在のシステムの世界ダイナミクスを暗黙的に捕捉し、パラメータ更新なしで新しい設定への適応を可能にする。シミュレーションおよび実世界のロボットプラットフォームでの広範な実験により、ICWMが新しいカメラ視点において標準的なVLAベースラインを大幅に上回ることが実証された。

English

Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current observations and language instructions. By ignoring the underlying system configuration as a variable, these models implicitly assume a fixed execution context encountered during training, necessitating data-intensive fine-tuning for any new environment. In this work, we introduce In-Context World Modeling (ICWM), a framework that treats system identification as an in-context adaptation problem. ICWM enables robot policies to autonomously infer essential system variables from a short history of self-generated, task-agnostic interactions. Unlike traditional In-Context Learning that uses demonstrations to specify what task to perform, ICWM leverages the context window to understand how the system operates. By processing these interactions before task execution, the model implicitly captures the world dynamics of the current system, enabling adaptation to novel configurations without parameter updates. Extensive experiments in simulation and on real-world robot platforms demonstrate that ICWM significantly outperforms standard VLA baselines on novel camera viewpoints.