预测之前先想象：用于视频事件预测的交错潜在视觉推理

摘要

视频事件预测要求模型从部分视频证据中推断未观察到的未来状态。现有视频多模态大语言模型通常将中间未来推理过程以文本形式表述：一旦视觉证据被转化为文字，细粒度的运动、几何及交互线索就可能丢失，从而产生看似合理但缺乏视觉依据的幻觉。我们提出 Future-L1——一种交错潜在视觉推理框架，该框架让多模态大语言模型在自回归解码过程中能够在语言标记与连续潜在视觉片段之间交替。为训练这一能力，我们构建了 Future-L1-50K 数据集，通过选取那些未来视觉提示有助于预测的样本，并将潜在状态与未来帧嵌入对齐，随后利用 LA-DAPO（一种融合结果对比和时间多样性奖励的潜在感知强化学习目标）对采样出的潜在轨迹进行进一步优化。Future-L1 在两项基准测试上均取得了新的最佳结果：在 FutureBench 上，它将 Qwen3-VL-8B 的得分从 61.0 提升至 85.4，并超出此前最优模型 Video-CoE 10.4 个百分点；在 TwiFF-Bench 上，它将平均得分从 2.44 提升至 3.04。这些结果表明，面向未来的视频推理更受益于在潜在空间中保留中间视觉语义，而非将每一步推理都转化为文本。

English

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.