可操作的世界表徵

摘要

受大型語言模型中湧現出泛化人類智慧的行為啟發，研究社群正致力於在世界模型中探索類似的湧現能力，尤其側重於對物理世界的建模。在物理世界模型的範疇內，物體是構成物理現實的基本單元。從人類到電腦，我們幾乎所有互動的對象都是物體。這些物體鮮少處於靜態，而是具備可操作性的實體，其狀態由內在屬性決定。當前方法雖透過影片生成或動態場景重建來處理物體動作狀態，卻未能以統一且具原則性的方式明確建模此基礎元素，以建立可操作的物體表徵。我們提出WorldString——一種神經網路架構，能透過直接從點雲或RGB-D影片串流中學習，來建模真實世界物體的狀態流形。作為通用的數位孿生，它成為物理世界模型的基礎建構單元；因此我們將其命名為WorldString。值得一提的是，其完全可微分的結構，能無縫整合未來的策略學習與神經動力學應用。

English

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.