繪製信息流:揭示VideoLLMs中隱藏的信息傳輸路徑
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
October 15, 2025
作者: Minji Kim, Taekyung Kim, Bohyung Han
cs.AI
摘要
视频大语言模型(VideoLLMs)将视觉语言模型的能力拓展至时空输入领域,实现了视频问答等任务。尽管视频大语言模型近期取得显著进展,但其内部关于视频与文本信息的提取及传播机制仍待深入探索。本研究运用机理可解释性技术,系统剖析了视频大语言模型的内部信息流。分析发现不同视频问答任务中存在一致的模式:(1)时序推理过程始于中低层神经元的跨帧交互激活;(2)随后通过中层实现渐进式的视频-语言融合,该过程依赖于视频表征与含有时序概念的语义嵌入之间的对齐;(3)完成融合后,模型在中高层已具备生成正确答案的能力;(4)基于此发现,我们证明通过筛选有效信息路径(如LLaVA-NeXT-7B-Video-FT模型可削减58%注意力边)即可保持视频问答性能。这些发现揭示了视频大语言模型进行时序推理的内在机制,为提升模型可解释性与下游泛化能力提供了实践依据。项目主页及源代码详见https://map-the-flow.github.io。
English
Video Large Language Models (VideoLLMs) extend the capabilities of
vision-language models to spatiotemporal inputs, enabling tasks such as video
question answering (VideoQA). Despite recent advances in VideoLLMs, their
internal mechanisms on where and how they extract and propagate video and
textual information remain less explored. In this study, we investigate the
internal information flow of VideoLLMs using mechanistic interpretability
techniques. Our analysis reveals consistent patterns across diverse VideoQA
tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame
interactions in early-to-middle layers, (2) followed by progressive
video-language integration in middle layers. This is facilitated by alignment
between video representations and linguistic embeddings containing temporal
concepts. (3) Upon completion of this integration, the model is ready to
generate correct answers in middle-to-late layers. (4) Based on our analysis,
we show that VideoLLMs can retain their VideoQA performance by selecting these
effective information pathways while suppressing a substantial amount of
attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a
blueprint on how VideoLLMs perform temporal reasoning and offer practical
insights for improving model interpretability and downstream generalization.
Our project page with the source code is available at
https://map-the-flow.github.io