ChatPaper.aiChatPaper

SlowFast-LLaVA:一個強大的無需訓練的基準線,適用於視頻大型語言模型

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

July 22, 2024
作者: Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan
cs.AI

摘要

我們提出了SlowFast-LLaVA(簡稱SF-LLaVA),這是一種無需訓練的視頻大型語言模型(LLM),能夠共同捕捉詳細的空間語義和長程時間上下文,同時不超出常用LLM的標記預算。這是通過使用兩流SlowFast設計的視頻LLM輸入來有效地聚合來自取樣視頻幀的特徵來實現的。具體而言,Slow 路徑以較低的幀率提取特徵,同時保留盡可能多的空間細節(例如,使用24x24標記),而Fast 路徑以較高的幀率運行,但使用較大的空間池化步幅(例如,下採樣6x)來專注於運動線索。因此,這種設計使我們能夠充分捕捉對於理解視頻中的細節有益的空間和時間特徵。實驗結果表明,SF-LLaVA在各種視頻任務上優於現有的無需訓練方法。在某些基準測試中,它實現了與在視頻數據集上微調的最先進視頻LLM相當甚至更好的性能。
English
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.

Summary

AI-Generated Summary

PDF415November 28, 2024