语义时刻:基于三阶矩特征的无训练运动相似性度量
SemanticMoments: Training-Free Motion Similarity via Third Moment Features
February 9, 2026
作者: Saar Huberman, Kfir Goldberg, Or Patashnik, Sagie Benaim, Ron Mokady
cs.AI
摘要
基于语义运动进行视频检索是一个基础性但尚未解决的难题。现有视频表征方法过度依赖静态外观和场景上下文,而非运动动态,这种偏差源自其训练数据和目标函数。相反,传统以运动为中心的输入(如光流)缺乏理解高层级运动所需的语义基础。为揭示这种固有偏差,我们提出了SimMotion基准测试集,将受控合成数据与人工标注的真实数据集相结合。实验表明,现有模型在这些基准测试上表现不佳,往往无法将运动与外观特征解耦。为弥补这一缺陷,我们提出SemanticMoments方法——一种无需训练的简易技术,通过计算预训练语义模型特征的时间统计量(具体为高阶矩)。在所有基准测试中,SemanticMoments始终优于现有的RGB、光流和文本监督方法。这证明语义特征空间中的时间统计量能为以运动为中心的视频理解提供可扩展且感知基础扎实的解决方案。
English
Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.