SlowFast-LLaVA：视频大语言模型的强大无需训练的基准线

摘要

我们提出了SlowFast-LLaVA（简称SF-LLaVA），这是一个无需训练的视频大型语言模型（LLM），能够同时捕获详细的空间语义和长距离的时间上下文，而不会超出常用LLM的令牌预算。这是通过使用两流SlowFast设计的视频LLM输入来有效地聚合来自采样视频帧的特征实现的。具体而言，慢路径以低帧率提取特征，同时尽可能保留许多空间细节（例如，使用24x24令牌），而快路径以较高帧率运行，但使用更大的空间池化步幅（例如，下采样6倍）以专注于运动线索。因此，这种设计使我们能够充分捕获对理解视频中的细节有益的空间和时间特征。实验结果表明，SF-LLaVA在各种视频任务上优于现有的无需训练方法。在某些基准测试中，它的性能与在视频数据集上微调的最先进视频LLM相当，甚至更好。

English

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.

SlowFast-LLaVA：视频大语言模型的强大无需训练的基准线

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

摘要

Support