PAVE：視頻大語言模型的修補與適應

摘要

預訓練的視頻大型語言模型（Video LLMs）展現出卓越的推理能力，然而將這些模型適應於涉及額外模態或數據類型（例如音頻或3D信息）的新任務仍然具有挑戰性。在本文中，我們提出了PAVE，這是一個靈活的框架，用於將預訓練的Video LLMs適應於帶有側信道信號的下游任務，如音頻、3D線索或多視角視頻。PAVE引入了輕量級的適配器，稱為“補丁”，這些適配器在基礎模型上添加少量參數和操作，而不改變其架構或預訓練權重。通過這種方式，PAVE能夠有效地使預訓練的基礎模型適應多樣的下游任務，包括音視頻問答、3D推理、多視角視頻識別以及高幀率視頻理解。在這些任務中，PAVE顯著提升了基礎模型的性能，超越了特定任務的最先進模型，同時僅增加了約0.1%的額外FLOPs和參數成本。此外，PAVE支持多任務學習，並在不同Video LLMs之間具有良好的泛化能力。我們的代碼可在https://github.com/dragonlzm/PAVE獲取。

English

Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.

PAVE：視頻大語言模型的修補與適應

PAVE: Patching and Adapting Video Large Language Models

摘要

Support