構造的因果推論によるマルチオブジェクト連携を介した映像推論

要旨

人間の動画理解は、通常、即時の演繹的推論のみに依存するのではなく、実体、行動、時間的関係の構造化された心的表現に基づいています。一方、既存のVideo-LLMsは、重要な視覚的証拠が冗長なテキスト記述に埋め込まれ、時間的因果関係が弱くモデル化されている、非構造化された動画推論に大きく依存しています。これにより、非効率なプロセスと脆弱な因果推論が生じます。この認知的ギャップを埋めるため、我々は推論段階の前に、顕著なイベントとその因果関係のコンパクトな表現を構築することを提案します。これを「構造化事実」と名付けます。この構造化された事前情報は、簡潔で因果関係に基づいた推論を促進する明示的な制約として機能するとともに、中間証拠の検証を容易にします。このような構造化された事実に基づいてモデルを効果的に訓練するため、CausalFact-60Kと、事実の調整、形式のウォームスタート、思考のウォームスタート、強化学習に基づく事後訓練を含む4段階の訓練パイプラインを導入します。強化学習段階において、このフレームワークは競合する目的を導入することが分かりました。すなわち、構造の完全性と因果的忠実性は、推論の長さとバランスを取らなければならず、最適化が困難です。我々はこの課題に対処するため、最適化を多目的強化学習問題として定式化し、これらのトレードオフのバランスを取るためにパレートフロンティアを明示的に最適化します。その結果、より信頼性の高い推論を実現し、細粒度の時間的推論を必要とする困難な動画理解タスクにおいて、より強力な性能を発揮するFactum-4Bを導入します。

English

Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.

構造的因果推論によるマルチオブジェクト連携を介した映像推論

Structured Causal Video Reasoning via Multi-Objective Alignment

要旨

Support