ChatPaper.aiChatPaper

LARY:一种生成通用化视觉-动作对齐基准的潜在动作表征方法

LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

April 13, 2026
作者: Dujun Nie, Fengjiao Chen, Qi Lv, Jun Kuang, Xiaoyu Li, Xuezhi Cao, Xunliang Cai
cs.AI

摘要

尽管显性动作数据的匮乏限制了视觉-语言-动作(VLA)模型的发展,人类行为视频却提供了可扩展但未标注的数据源。利用大规模人类视频数据集的关键挑战在于将视觉信号转化为独立于本体论的表征,即潜在动作。然而,潜在动作表征从视觉观察中推导出稳健控制的能力尚未得到严格评估。我们提出潜在动作表征生成(LARY)基准,这是一个用于评估潜在动作表征在高层语义动作(做什么)和低层机器人控制(如何做)两方面表现的统一框架。该精心构建的数据集涵盖151个动作类别、总时长1000小时的超百万视频,以及跨多种具身形式和环境生成的62万张图像对与59.5万条运动轨迹。实验揭示两个关键发现:(1)未经任何动作监督训练的通用视觉基础模型,其表现持续优于专门的具身潜在动作模型;(2)基于潜在表征的视觉空间与物理动作空间的对齐性本质优于基于像素的空间。这些结果表明,通用视觉表征本身已编码了适用于物理控制的动作相关知识,且语义级抽象作为从视觉到动作的转化路径,本质上比像素级重建更为有效。
English
While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied latent action models. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.
PDF71April 16, 2026