多目标对齐驱动的结构化因果视频推理
Structured Causal Video Reasoning via Multi-Objective Alignment
April 6, 2026
作者: Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang, Chengjie Wang, Mohammed Bennamoun, Farid Boussaid, Feng Zheng, Qiuhong Ke
cs.AI
摘要
人类对视频动态的理解通常基于对实体、动作及时间关系的结构化心理表征,而非单纯依赖即时演绎推理。相比之下,现有视频大语言模型主要采用非结构化视频推理方式,关键视觉证据被淹没在冗长的文本描述中,时间因果关系也往往建模薄弱,导致推理效率低下且因果推断脆弱。为弥合这种认知差距,我们提出在推理阶段前构建包含显著事件及其因果关系的紧凑表征(命名为"结构化事件事实"),该结构化先验通过显式约束促进简洁且因果扎根的推理,同时使中间证据更易验证。为有效训练基于此类结构化事实的模型,我们推出CausalFact-60K数据集及四阶段训练流程,包括事实对齐、格式预热、思维预热和基于强化学习的后训练。在强化学习阶段,我们发现该框架会引发目标冲突——结构完整性与因果保真度需与推理长度相互权衡,导致优化困难。通过将优化问题构建为多目标强化学习任务,并显式朝向帕累托前沿优化以平衡这些权衡,我们最终提出Factum-4B模型。该模型能产生更可靠的推理过程,在需要细粒度时间推理的挑战性视频理解任务中展现出更强性能。
English
Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.