ST-LLM：大型語言模型是有效的時間學習者。

摘要

大型語言模型（LLMs）展示了在文本理解和生成方面的印著能力，促使研究工作轉向視頻LLMs，以促進在視頻層面上進行人工智能交互。然而，在基於視頻的對話系統中如何有效地編碼和理解視頻仍有待解決。在本文中，我們探討了一個直接但未被探索的問題：我們是否可以將所有時空標記餵入LLM，從而將視頻序列建模的任務委託給LLMs？令人驚訝的是，這種簡單的方法在視頻理解方面取得了顯著的改進。基於此，我們提出了ST-LLM，一種具有空間-時間序列建模的有效視頻-LLM基線。此外，為了應對LLMs內未壓縮視頻標記引入的開銷和穩定性問題，我們開發了一種具有量身定制訓練目標的動態遮罩策略。對於特別長的視頻，我們還設計了一個全局-局部輸入模塊，以平衡效率和有效性。因此，我們利用LLM進行熟練的空間-時間建模，同時保持效率和穩定性。大量的實驗結果證實了我們方法的有效性。通過更簡潔的模型和訓練流程，ST-LLM在VideoChatGPT-Bench和MVBench上建立了一個新的最先進結果。代碼已經在https://github.com/TencentARC/ST-LLM 上提供。

English

Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs? Surprisingly, this simple approach yields significant improvements in video understanding. Based upon this, we propose ST-LLM, an effective video-LLM baseline with Spatial-Temporal sequence modeling inside LLM. Furthermore, to address the overhead and stability issues introduced by uncompressed video tokens within LLMs, we develop a dynamic masking strategy with tailor-made training objectives. For particularly long videos, we have also designed a global-local input module to balance efficiency and effectiveness. Consequently, we harness LLM for proficient spatial-temporal modeling, while upholding efficiency and stability. Extensive experimental results attest to the effectiveness of our method. Through a more concise model and training pipeline, ST-LLM establishes a new state-of-the-art result on VideoChatGPT-Bench and MVBench. Codes have been available at https://github.com/TencentARC/ST-LLM.

ST-LLM：大型語言模型是有效的時間學習者。

ST-LLM: Large Language Models Are Effective Temporal Learners

摘要

Support