ChatPaper.aiChatPaper

生成式动作特征分析:合成视频中人体运动的评估

Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

December 1, 2025
作者: Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram
cs.AI

摘要

尽管视频生成模型发展迅速,但评估复杂人类动作的视觉与时间正确性的稳健指标仍属空白。现有纯视觉编码器和多模态大语言模型(MLLMs)存在明显的外观偏好,缺乏时序理解能力,因而难以识别生成视频中精妙的运动动态和解剖结构不合理之处。针对这一缺陷,我们通过从真实人类动作的隐空间学习提出了一种新颖的评估指标。该方法通过融合外观无关的人体骨骼几何特征与外观特征,捕捉真实世界运动的细微差异、约束条件和时序平滑性。我们主张这种复合特征空间能有效表征动作合理性。对于生成视频,本指标通过计算其底层表征与学习的真实动作分布之间的距离来量化动作质量。为进行严谨验证,我们开发了专门用于检验人类动作保真度中时序挑战性维度的新型多角度基准测试。大量实验表明,本指标在我们的基准测试中相较现有最优方法实现超过68%的显著提升,在既有外部基准上表现优异,且与人类感知具有更强相关性。深度分析揭示了当前视频生成模型的关键局限,为视频生成领域的进阶研究确立了新标准。
English
Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.
PDF11December 6, 2025