ChatPaper.aiChatPaper

ST-LLM:大型语言模型在时间学习中表现出色

ST-LLM: Large Language Models Are Effective Temporal Learners

March 30, 2024
作者: Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li
cs.AI

摘要

大型语言模型(LLMs)在文本理解和生成方面展现了令人瞩目的能力,促使研究者们致力于开发视频LLMs,以促进在视频层面的人机交互。然而,如何在基于视频的对话系统中有效编码和理解视频仍是一个待解难题。本文探讨了一个直接却未被深入研究的问题:我们能否将所有时空标记输入LLM,从而将视频序列建模的任务委托给LLMs?令人惊讶的是,这种简单的方法在视频理解方面取得了显著的改进。基于此,我们提出了ST-LLM,这是一种有效的视频-LLM基线模型,其在LLM内部进行时空序列建模。此外,为了解决未压缩视频标记引入的计算开销和稳定性问题,我们开发了一种动态掩码策略,并定制了相应的训练目标。对于特别长的视频,我们还设计了一个全局-局部输入模块,以平衡效率和效果。因此,我们利用LLM进行熟练的时空建模,同时保持了效率和稳定性。广泛的实验结果证明了我们方法的有效性。通过更为简洁的模型和训练流程,ST-LLM在VideoChatGPT-Bench和MVBench上创下了新的最优结果。代码已公开于https://github.com/TencentARC/ST-LLM。
English
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs? Surprisingly, this simple approach yields significant improvements in video understanding. Based upon this, we propose ST-LLM, an effective video-LLM baseline with Spatial-Temporal sequence modeling inside LLM. Furthermore, to address the overhead and stability issues introduced by uncompressed video tokens within LLMs, we develop a dynamic masking strategy with tailor-made training objectives. For particularly long videos, we have also designed a global-local input module to balance efficiency and effectiveness. Consequently, we harness LLM for proficient spatial-temporal modeling, while upholding efficiency and stability. Extensive experimental results attest to the effectiveness of our method. Through a more concise model and training pipeline, ST-LLM establishes a new state-of-the-art result on VideoChatGPT-Bench and MVBench. Codes have been available at https://github.com/TencentARC/ST-LLM.

Summary

AI-Generated Summary

PDF81November 26, 2024