語義時刻:基於三階特徵的免訓練運動相似性計算
SemanticMoments: Training-Free Motion Similarity via Third Moment Features
February 9, 2026
作者: Saar Huberman, Kfir Goldberg, Or Patashnik, Sagie Benaim, Ron Mokady
cs.AI
摘要
基於語義運動的影片檢索是一個基礎但尚未解決的問題。現有影片表徵方法過度依賴靜態外觀和場景上下文,而非運動動態,這種偏差源自其訓練數據和目標。相反地,傳統以運動為核心的輸入(如光流)缺乏理解高層次運動所需的語義基礎。為揭示這種固有偏差,我們提出SimMotion基準測試,結合受控合成數據與全新人工標註的真實世界數據集。實驗顯示現有模型在這些基準上表現不佳,往往無法將運動與外觀特徵分離。為解決此問題,我們提出SemanticMoments——一種無需訓練的簡易方法,通過計算預訓練語義模型特徵的時間統計量(具體為高階矩)。在我們的基準測試中,SemanticMoments始終優於現有的RGB、光流和文本監督方法。這證明語義特徵空間中的時間統計量,能為以運動為核心的影片理解提供可擴展且具感知基礎的架構。
English
Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.