VLMは適応的テスト時間最適化による映像推論の優れた教師である

要旨

近年の「ビデオによる推論（Reasoning with Video）」パラダイムでは、ビデオ生成モデル（VGM）を活用し、時間的に一貫性のある視覚的な軌跡を生成することで推論タスクを完了させる。最先端のVGMは視覚品質に優れているものの、タスク固有のルールを理解し従うことが難しく、多様な推論シナリオにおいて論理的な失敗を引き起こすことが多い。既存の研究では、視覚言語モデル（VLM）を問題の事前解決器として利用し、VGM向けのテキストによるガイダンスを生成または洗練しようと試みている。しかし、テキストによる記述では複雑な時空間の詳細を捉えきれず、またVGMは有効な計画が与えられても、細粒度やロングテールの指示を忠実に実行するのに苦労する。一方、VLMは解決器としては課題があるものの、プロセス制約の充足や最終目標の達成を評価する強力な知覚能力を有している。この強みを活用し、本稿ではVLMの役割を「教師」へと移行させるパラダイムシフトを導入する。具体的には、VLM教師がタスク固有のルールを抽出して微分可能な報酬を定式化し、軽量なLoRAモジュールのテスト時オンライン最適化を通じてVGM推論器を誘導する。この戦略により、適応的なテスト時最適化が可能となり、VGM本来の限界を超えた推論能力が拡張される。シンボリック（VBVR-Bench）および汎用（RULER-Bench）のビデオ推論ベンチマークにおける評価では、提案手法が平均16.7ポイントの性能向上を示し、VLM-as-Solverパラダイム（+0.4ポイント）やBest-of-Nスケーリング（+2.2ポイント）を同程度のテスト時コストで大きく上回った。これらの知見は、VLMをテスト時教師として統合することが、汎用的なビデオ推論を実現する有望なパラダイムであることを明らかにしている。プロジェクトページ：https://VLM-as-Teacher.github.io/

English

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/