SiLVR: 단순 언어 기반 비디오 추론 프레임워크

초록

테스트 타임 최적화의 최근 발전은 대형 언어 모델(LLM)이 수학 및 코딩과 같은 매우 복잡한 문제를 해결할 수 있는 놀라운 추론 능력을 가능하게 했습니다. 그러나 멀티모달 LLM(MLLM)의 추론 능력은 여전히 크게 뒤처져 있으며, 특히 복잡한 비디오-언어 작업에서 더욱 두드러집니다. 이 문제를 해결하기 위해, 우리는 복잡한 비디오 이해를 두 단계로 분해하는 SiLVR(Simple Language-based Video Reasoning) 프레임워크를 제안합니다. 첫 번째 단계에서 SiLVR은 짧은 클립 캡션과 오디오/음성 자막과 같은 다감각 입력을 사용하여 원시 비디오를 언어 기반 표현으로 변환합니다. 두 번째 단계에서는 언어 설명을 강력한 추론 LLM에 입력하여 복잡한 비디오-언어 이해 작업을 해결합니다. 긴 문맥의 다감각 입력을 처리하기 위해, 우리는 적응형 토큰 축소 기법을 사용하여 토큰을 샘플링할 시간적 세분성을 동적으로 결정합니다. 우리의 간단하고 모듈화된, 훈련이 필요 없는 비디오 추론 프레임워크는 Video-MME(긴 버전), Video-MMMU(이해), Video-MMLU, CGBench, 그리고 EgoLife에서 최고의 결과를 달성했습니다. 또한, 비디오 추론 능력에 초점을 맞춘 우리의 실증적 연구는 비디오에 대해 명시적으로 훈련되지 않았음에도 불구하고, 강력한 추론 LLM이 비디오, 음성, 오디오로부터 다감각 입력 정보를 효과적으로 집계하여 복잡한 시간적, 인과적, 긴 문맥, 그리고 지식 습득 추론 작업을 수행할 수 있음을 보여줍니다. 코드는 https://github.com/CeeZh/SILVR에서 확인할 수 있습니다.

English

Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SiLVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an adaptive token reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. Code is available at https://github.com/CeeZh/SILVR.

SiLVR: 단순 언어 기반 비디오 추론 프레임워크

SiLVR: A Simple Language-based Video Reasoning Framework

초록

Support