行動可能な世界表現

要旨

大規模言語モデルにおける人間の知能を一般化した創発的行動に触発され、研究コミュニティは物理世界のモデリングに重点を置いた世界モデル内で同様の創発的能力を追求している。物理世界モデルの範囲において、オブジェクトは物理的現実を構成する基本的なプリミティブである。人間からコンピュータに至るまで、私たちが相互作用するほとんどすべてのものはオブジェクトである。これらのオブジェクトが静的であることは稀であり、それらは内在的特性によって決定される様々な状態を持つ操作可能なエンティティである。現在の手法は、ビデオ生成または動的なシーン再構成のいずれかを介してオブジェクトの動作状態にアプローチしているが、操作可能なオブジェクト表現を構築するために、この基本要素を統一的かつ原理的な方法で明示的にモデル化するものは存在しない。我々はWorldStringを提案する。これは、点群またはRGB-Dビデオストリームから直接学習することにより、実世界のオブジェクトの状態多様体をモデル化できるニューラルアーキテクチャである。多用途のデジタルツインとして機能し、物理世界モデルの基礎的構成要素となることから、これをWorldStringと命名した。特筆すべきは、その完全微分可能な構造により、将来のポリシー学習やニューラルダイナミクスとの統合がシームレスに可能となる点である。

English

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.