LARY：一种生成可泛化视觉-动作对齐基准的潜在动作表征

摘要

尽管显性动作数据的匮乏限制了视觉-语言-动作（VLA）模型的发展，人类行为视频却提供了可扩展但未标注的数据源。利用大规模人类视频数据集的关键挑战在于如何将视觉信号转化为独立于本体论的潜在动作表征。然而，潜在动作表征从视觉观察中推导稳健控制的能力尚未得到严格验证。我们提出潜在动作表征基准（LARY），这是一个统一评估框架，可同时评估高级语义动作（做什么）和低级机器人控制（怎么做）的潜在动作表征。该精心构建的数据集涵盖151个动作类别下超过100万段视频（1000小时），并包含62万张图像对和59.5万条运动轨迹，覆盖多样化的具身体现和环境配置。实验揭示两个关键发现：（1）未经任何动作监督训练的通用视觉基础模型，其表现持续优于专门的具身潜在动作模型；（2）基于潜在表征的视觉空间与物理动作空间的对齐性本质上优于基于像素的空间。这些结果表明，通用视觉表征天然编码了物理控制所需的动作相关知识，且语义级抽象作为从视觉到动作的路径，本质上比像素级重建更为有效。

English

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied latent action models. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.