CyberV: 映像理解におけるテスト時スケーリングのためのサイバネティクス

要旨

現在のマルチモーダル大規模言語モデル（MLLM）は、テスト時の計算負荷、堅牢性の欠如、および主にフィードフォワード処理の性質に起因する精度の限界から、長く複雑な動画の理解に苦戦する可能性があります。これらの制限は、パラメータ数が少ないモデルにおいてより深刻になる可能性があります。これらの制限に対処するため、サイバネティック原則に着想を得た新しいフレームワークを提案し、動画MLLMを推論中に自己監視、自己修正、および動的リソース割り当てが可能な適応システムとして再設計します。我々のアプローチであるCyberVは、MLLM推論システム、センサー、およびコントローラーからなるサイバネティックループを導入します。具体的には、センサーがMLLMの前方プロセスを監視し、アテンションのドリフトなどの中間解釈を収集し、その後コントローラーが自己修正をいつどのようにトリガーするかを決定し、次のラウンドを導くフィードバックを生成します。このテスト時適応スケーリングフレームワークは、再トレーニングや追加コンポーネントを必要とせずに、凍結されたMLLMを強化します。実験結果は、CyberVがQwen2.5-VL-7Bを8.3%、InternVL3-8Bを5.5%向上させ、競合するプロプライエタリモデルGPT-4oを凌駕することを示しています。Qwen2.5-VL-72Bに適用すると、10.0%の改善が得られ、人間の専門家に匹敵する性能を達成します。さらに、我々の手法は、VideoMMEやWorldSenseなどの汎用ベンチマークにおいても一貫した向上を示し、動画理解においてMLLMをより堅牢かつ正確にするための有効性と汎化能力を強調しています。コードはhttps://github.com/marinero4972/CyberVで公開されています。

English

Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at https://github.com/marinero4972/CyberV.

CyberV: 映像理解におけるテスト時スケーリングのためのサイバネティクス

CyberV: Cybernetics for Test-time Scaling in Video Understanding

要旨

Support