SlowFast-LLaVA:视频大语言模型的强大无需训练的基准线
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
July 22, 2024
作者: Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan
cs.AI
摘要
我们提出了SlowFast-LLaVA(简称SF-LLaVA),这是一个无需训练的视频大型语言模型(LLM),能够同时捕获详细的空间语义和长距离的时间上下文,而不会超出常用LLM的令牌预算。这是通过使用两流SlowFast设计的视频LLM输入来有效地聚合来自采样视频帧的特征实现的。具体而言,慢路径以低帧率提取特征,同时尽可能保留许多空间细节(例如,使用24x24令牌),而快路径以较高帧率运行,但使用更大的空间池化步幅(例如,下采样6倍)以专注于运动线索。因此,这种设计使我们能够充分捕获对理解视频中的细节有益的空间和时间特征。实验结果表明,SF-LLaVA在各种视频任务上优于现有的无需训练方法。在某些基准测试中,它的性能与在视频数据集上微调的最先进视频LLM相当,甚至更好。
English
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video
large language model (LLM) that can jointly capture the detailed spatial
semantics and long-range temporal context without exceeding the token budget of
commonly used LLMs. This is realized by using a two-stream SlowFast design of
inputs for Video LLMs to aggregate features from sampled video frames in an
effective way. Specifically, the Slow pathway extracts features at a low frame
rate while keeping as many spatial details as possible (e.g., with 24x24
tokens), and the Fast pathway operates on a high frame rate but uses a larger
spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As
a result, this design allows us to adequately capture both spatial and temporal
features that are beneficial for understanding details along the video.
Experimental results show that SF-LLaVA outperforms existing training-free
methods on a wide range of video tasks. On some benchmarks, it achieves
comparable or even better performance compared to state-of-the-art Video LLMs
that are fine-tuned on video datasets.