FoundationMotion:视频空间运动的自动标注与推理
FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
December 11, 2025
作者: Yulu Gan, Ligeng Zhu, Dandan Shan, Baifeng Shi, Hongxu Yin, Boris Ivanovic, Song Han, Trevor Darrell, Jitendra Malik, Marco Pavone, Boyi Li
cs.AI
摘要
运动理解是物理推理的基础,能使模型推断动态特性并预测未来状态。然而,当前最先进的模型在新型运动基准测试中仍表现不佳,主要源于缺乏大规模细粒度运动数据集。现有运动数据集通常依赖昂贵的人工标注,严重限制了可扩展性。为解决这一难题,我们推出FoundationMotion——一个全自动数据构建流程,能够大规模生成运动数据集。该方法首先通过视频中的物体检测与追踪提取运动轨迹,随后结合轨迹数据与视频帧,利用大语言模型生成关于运动和空间推理的细粒度描述及多样化问答对。基于该流程构建的数据集,我们对NVILA-Video-15B和Qwen2.5-7B等开源模型进行微调,在保持其他任务性能的同时显著提升了运动理解能力。值得注意的是,在多种运动理解数据集和基准测试中,我们的模型表现超越了Gemini-2.5 Flash等强闭源基线模型以及Qwen2.5-VL-72B等大型开源模型。FoundationMotion因此为构建细粒度运动数据集提供了可扩展的解决方案,能有效微调多样化模型以增强运动理解与空间推理能力。
English
Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.