ST-LLM: 대형 언어 모델은 효과적인 시간적 학습자입니다

초록

대형 언어 모델(LLM)은 텍스트 이해 및 생성에서 인상적인 능력을 보여주며, 이를 바탕으로 비디오 수준에서 인간-AI 상호작용을 촉진하기 위한 비디오 LLM 연구가 활발히 진행되고 있습니다. 그러나 비디오 기반 대화 시스템에서 비디오를 효과적으로 인코딩하고 이해하는 방법은 여전히 해결해야 할 과제로 남아 있습니다. 본 논문에서는 간단하지만 아직 탐구되지 않은 질문을 연구합니다: 모든 시공간 토큰을 LLM에 입력함으로써 비디오 시퀀스 모델링 작업을 LLM에 위임할 수 있을까요? 놀랍게도, 이 간단한 접근 방식은 비디오 이해에서 상당한 개선을 가져옵니다. 이를 바탕으로, 우리는 LLM 내부에서 시공간 시퀀스 모델링을 수행하는 효과적인 비디오-LLM 베이스라인인 ST-LLM을 제안합니다. 더 나아가, LLM 내에서 압축되지 않은 비디오 토큰으로 인한 오버헤드와 안정성 문제를 해결하기 위해 맞춤형 훈련 목표를 가진 동적 마스킹 전략을 개발했습니다. 특히 긴 비디오의 경우, 효율성과 효과성을 균형 있게 유지하기 위해 전역-지역 입력 모듈을 설계했습니다. 결과적으로, 우리는 효율성과 안정성을 유지하면서 LLM을 활용하여 능숙한 시공간 모델링을 수행합니다. 광범위한 실험 결과는 우리 방법의 효과성을 입증합니다. 더 간결한 모델과 훈련 파이프라인을 통해, ST-LLM은 VideoChatGPT-Bench와 MVBench에서 새로운 최첨단 결과를 달성했습니다. 코드는 https://github.com/TencentARC/ST-LLM에서 확인할 수 있습니다.

English

Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs? Surprisingly, this simple approach yields significant improvements in video understanding. Based upon this, we propose ST-LLM, an effective video-LLM baseline with Spatial-Temporal sequence modeling inside LLM. Furthermore, to address the overhead and stability issues introduced by uncompressed video tokens within LLMs, we develop a dynamic masking strategy with tailor-made training objectives. For particularly long videos, we have also designed a global-local input module to balance efficiency and effectiveness. Consequently, we harness LLM for proficient spatial-temporal modeling, while upholding efficiency and stability. Extensive experimental results attest to the effectiveness of our method. Through a more concise model and training pipeline, ST-LLM establishes a new state-of-the-art result on VideoChatGPT-Bench and MVBench. Codes have been available at https://github.com/TencentARC/ST-LLM.

ST-LLM: 대형 언어 모델은 효과적인 시간적 학습자입니다

ST-LLM: Large Language Models Are Effective Temporal Learners

초록

Support