視頻動作差分

摘要

兩位個體在執行相同動作時有何差異？在本研究中，我們提出了視頻動作差異識別（VidDiff）這一新穎任務，旨在識別相同動作視頻間的細微差別，此任務在教練指導與技能學習等領域具有廣泛應用。為推動這一新任務的發展，我們首先構建了VidDiffBench，這是一個包含549對視頻的基準數據集，其中包含4,469條精細動作差異的人類標註以及2,075個標明這些差異發生位置的時間戳。我們的實驗表明，VidDiffBench對GPT-4o和Qwen2-VL等最先進的大型多模態模型（LMMs）構成了顯著挑戰。通過分析LMMs在VidDiffBench上的失敗案例，我們揭示了該任務的兩大關鍵挑戰：跨兩視頻定位相關子動作以及精細幀級比較。為克服這些挑戰，我們提出了VidDiff方法，這是一種將任務分解為三個階段的代理工作流程：動作差異提議、關鍵幀定位及幀間差異分析，每個階段均利用專門的基礎模型。為促進這一新任務的未來研究，我們在https://huggingface.co/datasets/jmhb/VidDiffBench發布了基準數據集，並在http://jmhb0.github.io/viddiff提供了代碼。

English

How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at https://huggingface.co/datasets/jmhb/VidDiffBench and code at http://jmhb0.github.io/viddiff.