시간적 동적 컨텍스트 기반의 멀티모달 장영상 모델링

초록

대규모 언어 모델(LLMs)의 최근 발전은 비디오 이해 분야에서 상당한 돌파구를 마련했습니다. 그러나 기존 모델들은 LLM의 컨텍스트 길이 제약과 비디오 내 방대한 정보량으로 인해 긴 비디오 처리에 어려움을 겪고 있습니다. 최근 일부 방법론들이 긴 비디오 이해를 위해 설계되었지만, 토큰 압축 과정에서 중요한 정보를 잃거나 오디오와 같은 추가 모달리티를 처리하는 데 어려움을 겪는 경우가 많습니다. 본 연구에서는 프레임 간의 시간적 관계를 활용한 동적 긴 비디오 인코딩 방법인 Temporal Dynamic Context(TDC)를 제안합니다. 첫째, 프레임 간 유사성을 기반으로 비디오를 의미론적으로 일관된 장면으로 분할한 후, 비주얼-오디오 인코더를 사용하여 각 프레임을 토큰으로 인코딩합니다. 둘째, 각 세그먼트 내 토큰 수를 줄이기 위한 새로운 시간적 컨텍스트 압축기를 제안합니다. 구체적으로, 쿼리 기반 Transformer를 사용하여 비디오, 오디오, 명령어 텍스트 토큰을 제한된 수의 시간적 컨텍스트 토큰으로 집계합니다. 마지막으로, 정적 프레임 토큰과 시간적 컨텍스트 토큰을 LLM에 입력하여 비디오 이해를 수행합니다. 또한, 극단적으로 긴 비디오를 처리하기 위해 훈련이 필요 없는 사고의 연쇄(chain-of-thought) 전략을 제안합니다. 이 전략은 여러 비디오 세그먼트에서 점진적으로 답을 추출하며, 이러한 중간 답변은 추론 과정의 일부로 작용하여 최종 답변에 기여합니다. 일반 비디오 이해 및 오디오-비디오 이해 벤치마크에서 광범위한 실험을 수행한 결과, 우리의 방법이 강력한 성능을 보였습니다. 코드와 모델은 https://github.com/Hoar012/TDC-Video에서 확인할 수 있습니다.

English

Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.

시간적 동적 컨텍스트 기반의 멀티모달 장영상 모델링

Multimodal Long Video Modeling Based on Temporal Dynamic Context

초록

Support