SiLVR: シンプルな言語ベースのビデオ推論フレームワーク

要旨

近年のテスト時最適化の進展により、大規模言語モデル（LLMs）の推論能力が著しく向上し、数学やコーディングにおける高度に複雑な問題を解決できるようになりました。しかし、マルチモーダルLLMs（MLLMs）の推論能力は依然として大きく遅れており、特に複雑なビデオ言語タスクにおいて顕著です。この問題に対処するため、我々はSiLVR（Simple Language-based Video Reasoning）を提案します。これは、複雑なビデオ理解を2段階に分解するフレームワークです。第1段階では、SiLVRは短いクリップのキャプションや音声/スピーチの字幕などのマルチセンサリー入力を使用して、生のビデオを言語ベースの表現に変換します。第2段階では、言語記述を強力な推論LLMに供給し、複雑なビデオ言語理解タスクを解決します。長文脈のマルチセンサリー入力を扱うために、我々は適応的トークン削減スキームを使用し、トークンをサンプリングする時間的粒度を動的に決定します。このシンプルでモジュール化されたトレーニング不要のビデオ推論フレームワークは、Video-MME（長編）、Video-MMMU（理解）、Video-MMLU、CGBench、およびEgoLifeにおいて、これまでに報告された最高の結果を達成しました。さらに、ビデオ推論能力に焦点を当てた我々の実証研究は、ビデオに明示的にトレーニングされていないにもかかわらず、強力な推論LLMsがビデオ、スピーチ、音声からのマルチセンサリー入力情報を効果的に集約し、複雑な時間的、因果的、長文脈、および知識獲得推論タスクを遂行できることを示しています。コードはhttps://github.com/CeeZh/SILVRで公開されています。

English

Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SiLVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an adaptive token reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. Code is available at https://github.com/CeeZh/SILVR.

SiLVR: シンプルな言語ベースのビデオ推論フレームワーク

SiLVR: A Simple Language-based Video Reasoning Framework

要旨

Support