MotionSight:提升多模态大语言模型中的细粒度运动理解能力
MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs
June 2, 2025
作者: Yipeng Du, Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Xiang Li, Jian Yang, Zhenheng Yang, Ying Tai
cs.AI
摘要
儘管多模態大型語言模型(MLLMs)取得了進展,但其在細粒度視頻運動理解方面的能力仍然存在顯著限制。這些模型往往缺乏幀間差異分析,傾向於平均或忽略細微的視覺線索。此外,雖然視覺提示在靜態圖像中顯示出潛力,但其在視頻時間複雜性中的應用,特別是對於細粒度運動理解,仍然很大程度上未被探索。我們探討是否能夠釋放內在能力,提升MLLMs的運動感知,並實現針對解耦物體和相機運動線索的獨特視覺特徵。在本研究中,我們引入了MotionSight,這是一種新穎的零樣本方法,開創性地使用物體中心的視覺聚光燈和運動模糊作為視覺提示,有效提升細粒度運動理解,而無需訓練。為了將其轉化為有價值的數據資產,我們策劃了MotionVid-QA,這是首個用於細粒度視頻運動理解的大規模數據集,包含分層註釋,包括SFT和偏好數據,約40K個視頻片段和約87K個問答對。實驗表明,MotionSight在開源性能上達到了最先進水平,並與商業模型具有競爭力。特別是,對於細粒度運動理解,我們提出了一種新穎的零樣本技術和一個大規模、高質量的數據集。所有代碼和註釋將公開提供。
English
Despite advancements in Multimodal Large Language Models (MLLMs), their
proficiency in fine-grained video motion understanding remains critically
limited. They often lack inter-frame differencing and tend to average or ignore
subtle visual cues. Furthermore, while visual prompting has shown potential in
static images, its application to video's temporal complexities, particularly
for fine-grained motion understanding, remains largely unexplored. We
investigate whether inherent capability can be unlocked and boost MLLMs' motion
perception and enable distinct visual signatures tailored to decouple object
and camera motion cues. In this study, we introduce MotionSight, a novel
zero-shot method pioneering object-centric visual spotlight and motion blur as
visual prompts to effectively improve fine-grained motion understanding without
training. To convert this into valuable data assets, we curated MotionVid-QA,
the first large-scale dataset for fine-grained video motion understanding, with
hierarchical annotations including SFT and preference data, {\Theta}(40K) video
clips and {\Theta}(87K) QAs. Experiments show MotionSight achieves
state-of-the-art open-source performance and competitiveness with commercial
models. In particular, for fine-grained motion understanding we present a novel
zero-shot technique and a large-scale, high-quality dataset. All the code and
annotations will be publicly available.