CogniRoute：オムニモーダルモデルにおける社会的証拠のルーティング学習手法

要旨

オムニモーダルモデルは映像、音声、テキストを取り込むことができるが、複数のモダリティへの統一的アクセスは、モデルが適切な証拠を利用することを保証しない。このギャップは、特に社会的動画質問応答において顕著であり、回答がジェスチャー、口調、時間的手がかり、あるいは発言内容と視覚的表現の不一致に依存する場合がある。本稿では、社会的オムニ推論のためのスキーマ誘導型Mixture-of-ExpertsフレームワークであるCogniRouteを提案する。CogniRouteは、訓練のみに使用される認知スキーマを採用し、各事例をクロスモーダル関係、推論要求、時間的スコープに基づいて分解し、教師あり微調整中にこの構造と大域的なルーティングシグネチャを整合させる。さらに、ルーティング認識型強化学習を導入し、回答の正しさ、モダリティ一貫性推論、認知的时间的接地に対する報酬を用いて、トークン生成と専門家割り当てを共同最適化する。訓練と評価を支援するため、118Kの構造化訓練事例、接地された推論トレース、スキーマラベル、時間的証拠スパン、および手動検証済み評価分割を含む診断用社会的動画QAリソースであるOmniSocialBenchを構築した。CogniRouteはOmniSocialBench上で平均精度59.38%を達成し、最強のプロプライエタリベースラインを15.33パーセントポイント、最強のオープンソースオムニベースラインを26.77ポイント上回り、特に音声-視覚協調、矛盾解決、時間的に接地された社会的推論を必要とする問題で最大の改善を示した。

English

Omni-modal models can ingest video, audio, and text, but unified access to multiple modalities does not guarantee that a model uses the right evidence. This gap is especially pronounced in social video question answering, where the answer may hinge on a gesture, vocal tone, temporal cue, or mismatch between what is said and what is visually expressed. We introduce CogniRoute, a schema-guided Mixture-of-Experts framework for social omni reasoning. CogniRoute uses a training-only cognitive schema that factorizes each example by cross-modal relation, reasoning demand, and temporal scope, and aligns global routing signatures with this structure during supervised fine-tuning. We further introduce route-aware reinforcement learning, which jointly optimizes token generation and expert allocation using rewards for answer correctness, modality-consistent reasoning, and cognitive temporal grounding. To support training and evaluation, we construct OmniSocialBench, a diagnostic social video QA resource with 118K structured training examples, grounded reasoning traces, schema labels, temporal evidence spans, and a manually verified evaluation split. CogniRoute achieves 59.38\% average accuracy on OmniSocialBench, improving over the strongest proprietary baseline by 15.33 percentage points and the strongest open-source omni baseline by 26.77 points, with the largest gains on questions requiring audio-visual coordination, conflict resolution, and temporally grounded social inference.