ST-LLM: 大規模言語モデルは効果的な時系列学習者である

要旨

大規模言語モデル（LLM）は、テキストの理解と生成において印象的な能力を示しており、ビデオレベルでの人間とAIのインタラクションを促進するためのビデオLLMに向けた研究が進められています。しかし、ビデオベースの対話システムにおいて、ビデオを効果的にエンコードし理解する方法は未解決の問題です。本論文では、一見単純ながらも未開拓の疑問を探ります：すべての空間-時間トークンをLLMに入力し、ビデオシーケンスのモデリングタスクをLLMに委任することは可能か？驚くべきことに、このシンプルなアプローチはビデオ理解において大幅な改善をもたらします。これに基づき、我々はST-LLMを提案します。これは、LLM内で空間-時間シーケンスモデリングを行う効果的なビデオLLMのベースラインです。さらに、LLM内の非圧縮ビデオトークンによって引き起こされるオーバーヘッドと安定性の問題に対処するため、動的マスキング戦略と特注のトレーニング目標を開発しました。特に長いビデオに対しては、効率と効果を両立させるためのグローバル-ローカル入力モジュールも設計しました。その結果、我々はLLMを活用して熟練した空間-時間モデリングを行いながら、効率と安定性を維持します。広範な実験結果は、我々の手法の有効性を裏付けています。より簡潔なモデルとトレーニングパイプラインを通じて、ST-LLMはVideoChatGPT-BenchとMVBenchにおいて新たな最先端の結果を確立しました。コードはhttps://github.com/TencentARC/ST-LLMで公開されています。

English

Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs? Surprisingly, this simple approach yields significant improvements in video understanding. Based upon this, we propose ST-LLM, an effective video-LLM baseline with Spatial-Temporal sequence modeling inside LLM. Furthermore, to address the overhead and stability issues introduced by uncompressed video tokens within LLMs, we develop a dynamic masking strategy with tailor-made training objectives. For particularly long videos, we have also designed a global-local input module to balance efficiency and effectiveness. Consequently, we harness LLM for proficient spatial-temporal modeling, while upholding efficiency and stability. Extensive experimental results attest to the effectiveness of our method. Through a more concise model and training pipeline, ST-LLM establishes a new state-of-the-art result on VideoChatGPT-Bench and MVBench. Codes have been available at https://github.com/TencentARC/ST-LLM.

ST-LLM: 大規模言語モデルは効果的な時系列学習者である

ST-LLM: Large Language Models Are Effective Temporal Learners

要旨

Support