ChatPaper.aiChatPaper

SiLVR:一种基于语言的简易视频推理框架

SiLVR: A Simple Language-based Video Reasoning Framework

May 30, 2025
作者: Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, Gedas Bertasius
cs.AI

摘要

近期,测试时优化技术的进步显著提升了大型语言模型(LLMs)的推理能力,使其能够解决数学和编程中的高度复杂问题。然而,多模态大型语言模型(MLLMs)的推理能力仍显不足,尤其是在处理复杂的视频语言任务时。针对这一问题,我们提出了SiLVR,一个基于语言的简单视频推理框架,它将复杂的视频理解分解为两个阶段。第一阶段,SiLVR利用多感官输入(如短视频片段描述和音频/语音字幕)将原始视频转化为基于语言的表示。第二阶段,这些语言描述被输入到一个强大的推理LLM中,以解决复杂的视频语言理解任务。为了处理长上下文的多感官输入,我们采用了一种自适应令牌缩减方案,动态决定采样令牌的时间粒度。我们这一简单、模块化且无需训练的视频推理框架在Video-MME(长)、Video-MMMU(理解)、Video-MMLU、CGBench和EgoLife上取得了目前最佳的报告结果。此外,我们针对视频推理能力的实证研究表明,尽管未在视频上明确训练,强大的推理LLMs仍能有效整合来自视频、语音和音频的多感官输入信息,用于视频中的复杂时序、因果、长上下文及知识获取推理任务。代码可在https://github.com/CeeZh/SILVR获取。
English
Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SiLVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an adaptive token reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. Code is available at https://github.com/CeeZh/SILVR.

Summary

AI-Generated Summary

PDF52June 2, 2025