テキストからビデオへの時間推論の転送

要旨

ビデオ大規模言語モデル（Video LLMs）は、ビデオ理解において有望な能力を示していますが、時間的変化の追跡や時間的関係の推論に苦労しています。以前の研究では、視覚的入力の効果的な時間エンコーディングの不足がこの制約の原因であるとされてきましたが、私たちの診断研究により、ビデオ表現には、さえない探査分類器でも完全な正確さを達成するのに十分な情報が含まれていることが明らかになりました。驚くべきことに、ビデオ LLMs の時間的推論能力の主要なボトルネックは、テキスト形式の時間的質問応答タスクでのパフォーマンスの低さから明らかになるように、基盤となる LLM の時間的概念に対する固有の難しさにあります。この発見を基に、私たちは Textual Temporal reasoning Transfer（T3）を導入します。T3 は、既存の画像テキストデータセットから純粋なテキスト形式の多様な時間的推論タスクを合成し、複雑な時間シナリオを持つビデオサンプルの不足に対処します。驚くべきことに、ビデオデータを使用せずに、T3 は LongVA-7B の時間理解を向上させ、難解な TempCompass ベンチマークで 5.3 の絶対精度向上をもたらし、28,000 のビデオサンプルでトレーニングされた ShareGPT4Video-8B を上回るモデルを実現します。さらに、強化された LongVA-7B モデルは包括的なビデオベンチマークで競争力のあるパフォーマンスを達成します。例えば、Video-MME の時間推論タスクで 49.7 の精度を達成し、InternVL-Chat-V1.5-20B や VILA1.5-40B などの強力な大規模モデルを上回ります。さらなる分析から、テキストとビデオの時間的タスクのパフォーマンスに強い相関関係があることが明らかになり、テキストからビデオ領域への時間的推論能力の転送の効果を検証しています。

English

Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs' temporal reasoning capability stems from the underlying LLM's inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the Textual Temporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the scarcity of video samples with complex temporal scenarios. Remarkably, without using any video data, T3 enhances LongVA-7B's temporal understanding, yielding a 5.3 absolute accuracy improvement on the challenging TempCompass benchmark, which enables our model to outperform ShareGPT4Video-8B trained on 28,000 video samples. Additionally, the enhanced LongVA-7B model achieves competitive performance on comprehensive video benchmarks. For example, it achieves a 49.7 accuracy on the Temporal Reasoning task of Video-MME, surpassing powerful large-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further analysis reveals a strong correlation between textual and video temporal task performance, validating the efficacy of transferring temporal reasoning abilities from text to video domains.

テキストからビデオへの時間推論の転送

Temporal Reasoning Transfer from Text to Video

要旨

Support