ChatPaper.aiChatPaper

影片即答案:透過聯合GRPO預測並生成下一影片事件

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

November 20, 2025
作者: Junhao Cheng, Liang Hou, Xin Tao, Jing Liao
cs.AI

摘要

雖然語言模型已在許多現實應用中發揮影響力,但影片生成技術目前仍主要侷限於娛樂領域。鑑於影片具備展現物理世界資訊的先天優勢(例如僅透過文字教人打領帶的困難度),我們發現可將影片拓展為「下一事件預測」的新型態答案模態,並將其形式化為「影片下一事件預測」任務。傳統NEP任務需輸入包含程序性或預測性問題的影片,以文字形式預測下一事件;而VNEP則要求生成動態影片回應。這種從「講述」到「展示」的轉變,能為程序性學習與創意探索提供更直觀且客製化的解答。然而,現有模型在此任務上面臨挑戰,因其需具備多模態輸入理解、指令條件推理,以及生成視覺與語義連貫影片的能力。為此,我們提出VANS模型,透過強化學習對齊視覺語言模型與影片擴散模型,以實現VNEP任務。VANS的核心是我們提出的Joint-GRPO機制,能協調VLM與VDM作為協同單元運作:基於對各自輸出的共享獎勵,該機制既優化VLM生成兼具準確性與可視化友善度的描述,同時引導VDM生成符合描述與輸入視覺脈絡的影片。為支持此學習框架,我們構建了專屬資料集VANS-Data-100K。在程序性與預測性基準測試中的實驗表明,VANS在影片事件預測與可視化方面均達到最先進性能。程式碼已發佈於https://github.com/KlingTeam/VANS。
English
While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.
PDF313December 1, 2025