可执行世界表征

摘要

受大语言模型中涌现行为的启发，这些行为展现了普遍化的人类智能，研究界正致力于在世界模型（尤其是物理世界建模）中探索类似的涌现能力。在物理世界模型的范畴内，物体是构成物理现实的基本原语。从人类到计算机，我们几乎与之交互的一切皆为物体。这些物体很少是静态的，而是可操作的实体，其状态随内在属性变化。当前的方法要么通过视频生成，要么通过动态场景重建来处理物体的行动状态，但均未以统一且原则性的方式显式建模这一基本元素，从而构建可操作的物体表征。我们提出WorldString——一种神经架构，能够通过直接从点云或RGB-D视频流中学习，对真实世界物体的状态流形进行建模。作为通用的数字孪生，它构成了物理世界模型的基础构建模块，因此我们将其命名为WorldString。更巧妙的是，其完全可微的结构能够无缝地支持未来与策略学习和神经动力学的集成。

English

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.