In-context wereldmodellering voor robotbesturing

Samenvatting

Moderne Visie-Taal-Actie (VTA)-modellen falen vaak in het generaliseren naar nieuwe opstellingen, zoals gewijzigde camerastandpunten of robotmorfologieën, omdat ze doorgaans alleen worden geconditioneerd op huidige waarnemingen en taalopdrachten. Door de onderliggende systeemconfiguratie als variabele te negeren, veronderstellen deze modellen impliciet een vaste uitvoeringscontext zoals die tijdens de training werd aangetroffen, wat data-intensieve fijnafstemming vereist voor elke nieuwe omgeving. In dit werk introduceren we In-Context Wereldmodellering (ICWM), een raamwerk dat systeemidentificatie behandelt als een in-context adaptatieprobleem. ICWM stelt robotbeleidsregels in staat om autonoom essentiële systeemvariabelen af te leiden uit een korte geschiedenis van zelf gegenereerde, taak-agnostische interacties. In tegenstelling tot traditioneel In-Context Leren, dat demonstraties gebruikt om te specificeren welke taak moet worden uitgevoerd, benut ICWM de contextvenster om te begrijpen hoe het systeem werkt. Door deze interacties voor de taakuitvoering te verwerken, legt het model impliciet de werelddynamiek van het huidige systeem vast, waardoor aanpassing aan nieuwe configuraties mogelijk is zonder parameterupdates. Uitgebreide experimenten in simulatie en op echte robotplatforms tonen aan dat ICWM significant beter presteert dan standaard VTA-basismodellen bij nieuwe camerastandpunten.

English

Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current observations and language instructions. By ignoring the underlying system configuration as a variable, these models implicitly assume a fixed execution context encountered during training, necessitating data-intensive fine-tuning for any new environment. In this work, we introduce In-Context World Modeling (ICWM), a framework that treats system identification as an in-context adaptation problem. ICWM enables robot policies to autonomously infer essential system variables from a short history of self-generated, task-agnostic interactions. Unlike traditional In-Context Learning that uses demonstrations to specify what task to perform, ICWM leverages the context window to understand how the system operates. By processing these interactions before task execution, the model implicitly captures the world dynamics of the current system, enabling adaptation to novel configurations without parameter updates. Extensive experiments in simulation and on real-world robot platforms demonstrate that ICWM significantly outperforms standard VLA baselines on novel camera viewpoints.