面向机器人控制的上下文世界建模

摘要

现代视觉-语言-动作（VLA）模型通常难以泛化到新环境，例如改变相机视角或机器人形态，因为这些模型往往仅依赖当前的观测和语言指令。由于将底层系统配置视为不变因素，这些模型隐含地假设了训练时遇到固定执行环境，因而针对任何新环境都需要进行数据密集型的微调。在本工作中，我们提出了上下文世界建模（ICWM）框架，将系统辨识视为一个上下文自适应问题。ICWM使机器人策略能够从短期的、自生成且与任务无关的交互历史中自主推断出关键的系统变量。与传统上下文学习利用示范来指定要执行的任务不同，ICWM利用上下文窗口来理解系统如何运作。通过在任务执行前处理这些交互，模型隐式地捕获了当前系统的世界动态，从而能够在无需更新参数的情况下适应新的配置。在仿真和真实机器人平台上进行的广泛实验表明，ICWM在应对新相机视角方面显著优于标准的VLA基线模型。

English

Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current observations and language instructions. By ignoring the underlying system configuration as a variable, these models implicitly assume a fixed execution context encountered during training, necessitating data-intensive fine-tuning for any new environment. In this work, we introduce In-Context World Modeling (ICWM), a framework that treats system identification as an in-context adaptation problem. ICWM enables robot policies to autonomously infer essential system variables from a short history of self-generated, task-agnostic interactions. Unlike traditional In-Context Learning that uses demonstrations to specify what task to perform, ICWM leverages the context window to understand how the system operates. By processing these interactions before task execution, the model implicitly captures the world dynamics of the current system, enabling adaptation to novel configurations without parameter updates. Extensive experiments in simulation and on real-world robot platforms demonstrate that ICWM significantly outperforms standard VLA baselines on novel camera viewpoints.