WristWorld:通过4D世界模型生成腕部视角以助力机器人操控
WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation
October 8, 2025
作者: Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, Shanghang Zhang
cs.AI
摘要
腕部视角观测对于VLA模型至关重要,因为它们捕捉到了精细的手-物交互,直接提升了操控性能。然而,大规模数据集鲜少包含此类记录,导致丰富的锚点视角与稀缺的腕部视角之间存在显著差距。现有世界模型无法弥合这一差距,因为它们需要腕部视角的首帧图像,因而无法仅凭锚点视角生成腕部视角视频。在这一差距中,近期如VGGT等视觉几何模型凭借几何与跨视角先验知识,为解决极端视角转换提供了可能。受此启发,我们提出了WristWorld,首个仅从锚点视角生成腕部视角视频的4D世界模型。WristWorld分两阶段运行:(i) 重建阶段,扩展VGGT并引入我们的空间投影一致性(SPC)损失,以估计几何一致的腕部视角姿态与4D点云;(ii) 生成阶段,采用我们的视频生成模型,从重建的视角合成时间连贯的腕部视角视频。在Droid、Calvin及Franka Panda上的实验展示了具有卓越空间一致性的最先进视频生成能力,同时提升了VLA性能,将Calvin上的平均任务完成长度提高了3.81%,并缩小了42.4%的锚点-腕部视角差距。
English
Wrist-view observations are crucial for VLA models as they capture
fine-grained hand-object interactions that directly enhance manipulation
performance. Yet large-scale datasets rarely include such recordings, resulting
in a substantial gap between abundant anchor views and scarce wrist views.
Existing world models cannot bridge this gap, as they require a wrist-view
first frame and thus fail to generate wrist-view videos from anchor views
alone. Amid this gap, recent visual geometry models such as VGGT emerge with
geometric and cross-view priors that make it possible to address extreme
viewpoint shifts. Inspired by these insights, we propose WristWorld, the first
4D world model that generates wrist-view videos solely from anchor views.
WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and
incorporates our Spatial Projection Consistency (SPC) Loss to estimate
geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation,
which employs our video generation model to synthesize temporally coherent
wrist-view videos from the reconstructed perspective. Experiments on Droid,
Calvin, and Franka Panda demonstrate state-of-the-art video generation with
superior spatial consistency, while also improving VLA performance, raising the
average task completion length on Calvin by 3.81% and closing 42.4% of the
anchor-wrist view gap.