실행 가능한 세계 표현

초록

대규모 언어 모델에서 인간 지능을 일반화하는 창발적 행동에 영감을 받아, 연구 커뮤니티는 물리적 세계 모델링에 중점을 두고 세계 모델 내에서 유사한 창발적 능력을 추구하고 있다. 물리적 세계 모델의 범위 내에서 객체는 물리적 현실을 구성하는 기본 요소이다. 인간에서 컴퓨터에 이르기까지 우리가 상호작용하는 거의 모든 것은 객체이다. 이러한 객체는 거의 정적이지 않으며, 고유 속성에 의해 결정되는 다양한 상태를 가진 행동 가능한 개체이다. 현재 방법들은 비디오 생성이나 동적 장면 재구성을 통해 객체의 행동 상태를 접근하지만, 통일되고 원칙적인 방식으로 이 기본 요소를 명시적으로 모델링하여 행동 가능한 객체 표현을 구축하지는 않는다. 우리는 포인트 클라우드나 RGB-D 비디오 스트림에서 직접 학습함으로써 실제 세계 객체의 상태 다양체를 모델링할 수 있는 신경망 아키텍처인 WorldString을 제안한다. 다용도 디지털 트윈으로 기능하며, 물리적 세계 모델의 기본 구성 요소 역할을 하기에 이를 WorldString이라고 명명하였다. 흥미롭게도, 이 완전 미분 가능 구조는 향후 정책 학습 및 신경 역학과의 원활한 통합을 가능하게 한다.

English

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.