MIND-V:基于强化学习物理对齐的长时程机器人操作分层视频生成
MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment
December 7, 2025
作者: Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Zhangrui Guo, Xiaofan Liu, Zunnan Xu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, Xiu Li
cs.AI
摘要
具身模仿学习受限于多样化、长周期机器人操作数据的稀缺性。现有该领域的视频生成模型仅能合成简单动作的短片段,且常依赖人工定义的轨迹。为此,我们提出MIND-V——一种分层框架,旨在生成物理合理且逻辑连贯的长周期机器人操作视频。受认知科学启发,MIND-V通过三个核心组件连接高层推理与像素级合成:基于预训练视觉语言模型进行任务规划的语义推理中枢(SRH);将抽象指令转化为领域无关表征的行为语义桥(BSB);以及条件视频渲染的运动视频生成器(MVG)。MIND-V采用分阶段视觉未来推演策略,这是一种提升长周期鲁棒性的测试时优化方法。为使生成视频符合物理规律,我们引入了基于新型物理前瞻一致性(PFC)奖励的GRPO强化学习后训练阶段。PFC利用V-JEPA世界模型,通过在特征空间中对预测与实际动态演化进行对齐来强化物理合理性。MIND-V在长周期机器人操作视频生成任务中实现了最先进的性能,为具身数据合成建立了可扩展且可控的范式。
English
Embodied imitation learning is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video generation models for this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a hierarchical framework designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To align the generated videos with physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA world model to enforce physical plausibility by aligning the predicted and actual dynamic evolutions in the feature space. MIND-V demonstrates state-of-the-art performance in long-horizon robotic manipulation video generation, establishing a scalable and controllable paradigm for embodied data synthesis.