ChatPaper.aiChatPaper

视频即答案:基于联合GRPO的下一视频事件预测与生成

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

November 20, 2025
作者: Junhao Cheng, Liang Hou, Xin Tao, Jing Liao
cs.AI

摘要

尽管语言模型已在众多实际应用中产生重要影响,视频生成领域仍主要局限于娱乐用途。受视频与生俱来的物理世界信息展示能力启发(例如仅通过文本教导他人打领带的困难),我们发现了一个尚未充分利用的机遇:将视频拓展为下一代事件预测(NEP)的新型答案模态,并将其形式化为视频化下一代事件预测(VNEP)。传统NEP任务以包含程序性或预测性问题的视频作为输入,通过文本来预测下一事件,而VNEP则需要动态视频响应。这种从"讲述"到"展示"的转变,为程序化学习和创意探索开启了更直观、更个性化的解答方式。然而,该任务对现有模型仍具挑战性,因其需要理解多模态输入、进行指令条件推理,并生成视觉与语义一致的视频。为此,我们提出VANS模型,通过强化学习将视觉语言模型(VLM)与视频扩散模型(VDM)对齐以实现VNEP。VANS的核心是我们提出的联合生成式强化策略优化(Joint-GRPO),它能协调VLM和VDM作为整体运作。基于两者输出的共享奖励机制,该策略既优化VLM生成兼具准确性和可视化友好度的描述文本,又指导VDM生成符合文本描述及输入视觉语境的视频。为支撑此学习过程,我们构建了专用于VNEP任务的VANS-Data-100K数据集。在程序性和预测性基准测试上的实验表明,VANS在视频事件预测与可视化方面均实现了最先进的性能。代码已发布于https://github.com/KlingTeam/VANS。
English
While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.
PDF313December 1, 2025