로봇 조작을 위한 세계 가치 모델

초록

일반화 가치 모델은 대규모의 혼합 품질 데이터로부터 로봇 정책 학습을 확장하는 데 핵심적인 역할을 한다. 수학적으로 정확한 가치 추정은 깊은 시간적 이해를 요구하며, 모델은 과거 맥락을 이용해 현재 신념을 grounding하는 동시에 미래 결과에 대한 계획을 수립할 수 있어야 한다. 그러나 기존의 대부분 로봇 가치 모델은 주로 정적이거나 시간적으로 희소한 시각 관측 데이터로 사전 학습된 비전-언어 모델(VLM) 백본 위에 구축되어 있어, 가치 추정에 필요한 시간적 모델링 능력이 부족하다. VLM과 달리 세계 모델은 시간적 모델링과 미래 계획에 자연스럽게 탁월하여 일반화 가능한 가치 함수를 학습하기 위한 이상적인 기반이 된다. 이러한 통찰에 기반하여, 우리는 세계 모델과 가치 추정을 결합하여 새로운 일반화 로봇 가치 모델인 World Value Model(WVM)을 구축했으며, 이는 데이터 품질을 평가하기 위한 정확한 작업 진행 상황을 제공한다. 표준 벤치마크에서 WVM은 최고 수준(SOTA)의 가치 순서 상관관계(VOC) 결과를 달성한다. 전문가 데이터만 포함하는 표준 평가 스위트를 보완하기 위해, 우리는 800개의 차선 궤적과 고충실도 인간 주석 프레임으로 구성된 다중 체현 벤치마크인 Suboptimal-Value-Bench를 추가로 도입한다. 평가 결과, WVM은 Suboptimal-Value-Bench에서도 SOTA 성능을 유지하여 전문가 데이터와 차선 데이터 모두를 처리하는 강건성을 입증한다. 정책 학습에 배치되었을 때, WVM은 시뮬레이션 및 실제 배치 환경 모두에서 다양한 정책 추출 접근법의 조작 성능을 향상시키며, 혼합 품질 데이터로부터 학습을 위한 강력한 지침을 제공한다.

English

Generalist value models play a pivotal role in scaling robotic policy learning from large-scale, mixed-quality data. Mathematically, accurate value estimation demands deep temporal understanding, requiring models to both ground the current belief using historical context and plan over future outcomes. However, most existing robotic value models are built on Vision-Language Model (VLM) backbones that are pretrained primarily on static or temporally sparse visual observations, lacking the requisite temporal modeling capabilities for value estimation. Unlike VLMs, world models naturally excel at temporal modeling and future planning, making them ideal foundations for learning generalizable value functions. Driven by this insight, we marry world models with value estimation to construct a new generalist robotic value model, World Value Model (WVM), that offers accurate task progressions to assess data quality. On standard benchmarks, WVM delivers state-of-the-art (SOTA) Value-Order Correlation (VOC) results. Complementing standard evaluation suites that contains only expert data, we further introduce Suboptimal-Value-Bench, a multi-embodiment benchmark consisting of 800 suboptimal trajectories with high-fidelity, human-labeled frame annotations. Our evaluations show that WVM maintains its SOTA performance on Suboptimal-Value-Bench, establishing its robustness in handling both expert and suboptimal data. When deployed for policy learning, WVM improves manipulation performance across various policy extraction approaches in both simulated and real-world deployment, providing robust guidance for learning from mixed-quality data.