ChatPaper.aiChatPaper

未来光流预测技术提升机器人控制与视频生成能力

Future Optical Flow Prediction Improves Robot Control & Video Generation

January 15, 2026
作者: Kanchana Ranasinghe, Honglu Zhou, Yu Fang, Luyu Yang, Le Xue, Ran Xu, Caiming Xiong, Silvio Savarese, Michael S Ryoo, Juan Carlos Niebles
cs.AI

摘要

未来运动表征(如光流)在控制和生成任务中具有重要价值。然而,预测具有泛化能力的空间密集运动表征仍是核心挑战,且从嘈杂的真实世界数据中学习此类预测的研究尚属空白。我们提出FOFPred——一种采用统一视觉语言模型与扩散架构的新型语言条件光流预测模型。这种独特组合通过像素级生成保真度实现了强大的多模态推理能力,用于未来运动预测。我们的模型基于网络规模的人类活动数据进行训练,这种数据源具有高度可扩展性但结构松散。为从嘈杂的视频-文本数据中提取有效信号,我们采用了关键的数据预处理技术,并借助强图像预训练构建统一架构。训练完成的模型可进一步应用于控制和生成两大下游任务。在语言驱动场景下进行的机器人操控与视频生成评估表明,FOFPred具有跨领域通用性,验证了统一VLM-扩散架构及从多样化网络数据中进行可扩展学习对未来光流预测的价值。
English
Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.
PDF81January 20, 2026