フローの可視化：VideoLLMにおける情報の隠れた経路を明らかにする

要旨

ビデオ大規模言語モデル（VideoLLM）は、視覚言語モデルの能力を時空間入力に拡張し、ビデオ質問応答（VideoQA）などのタスクを可能にします。VideoLLMの最近の進展にもかかわらず、ビデオとテキスト情報をどこでどのように抽出・伝達するかという内部メカニズムは十分に解明されていません。本研究では、機械論的解釈可能性技術を用いてVideoLLMの内部情報フローを調査します。分析により、多様なVideoQAタスクにわたる一貫したパターンが明らかになりました：（1）VideoLLMにおける時間的推論は、中層から中層にかけて活発なクロスフレーム相互作用から始まり、（2）続いて中層でビデオと言語の統合が進行します。これは、時間的概念を含む言語埋め込みとビデオ表現の間のアライメントによって促進されます。（3）この統合が完了すると、モデルは中層から後層で正答を生成する準備が整います。（4）分析に基づき、VideoLLMはこれらの有効な情報経路を選択しつつ、例えばLLaVA-NeXT-7B-Video-FTでは58％といった大量のアテンションエッジを抑制することで、VideoQA性能を維持できることを示します。これらの発見は、VideoLLMが時間的推論を実行する方法の設計図を提供し、モデルの解釈可能性と下流タスクへの一般化能力を向上させる実用的な知見を提供します。ソースコード付きのプロジェクトページはhttps://map-the-flow.github.ioで公開されています。

English

Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint on how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization. Our project page with the source code is available at https://map-the-flow.github.io

フローの可視化：VideoLLMにおける情報の隠れた経路を明らかにする

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

要旨

Support