PAVE: 비디오 대형 언어 모델 패치 및 적응

초록

사전 학습된 비디오 대형 언어 모델(Video LLMs)은 뛰어난 추론 능력을 보여주지만, 오디오나 3D 정보와 같은 추가적인 모달리티나 데이터 유형을 포함하는 새로운 작업에 이러한 모델을 적용하는 것은 여전히 어려운 과제로 남아 있습니다. 본 논문에서는 사전 학습된 Video LLMs를 오디오, 3D 단서, 다중 뷰 비디오와 같은 사이드 채널 신호를 포함하는 다운스트림 작업에 적응시키기 위한 유연한 프레임워크인 PAVE를 소개합니다. PAVE는 "패치"라고 불리는 경량 어댑터를 도입하여, 기본 모델의 아키텍처나 사전 학습된 가중치를 수정하지 않고도 소수의 파라미터와 연산만을 추가합니다. 이를 통해 PAVE는 사전 학습된 기본 모델을 오디오-시각적 질의 응답, 3D 추론, 다중 뷰 비디오 인식, 고프레임 레이트 비디오 이해와 같은 다양한 다운스트림 작업에 효과적으로 적응시킬 수 있습니다. 이러한 작업들에서 PAVE는 기본 모델의 성능을 크게 향상시키며, 최신 작업별 모델을 능가하면서도 약 0.1%의 추가 FLOPs와 파라미터 비용만을 발생시킵니다. 또한, PAVE는 다중 작업 학습을 지원하며 다양한 Video LLMs에 걸쳐 잘 일반화됩니다. 우리의 코드는 https://github.com/dragonlzm/PAVE에서 확인할 수 있습니다.

English

Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.

PAVE: 비디오 대형 언어 모델 패치 및 적응

PAVE: Patching and Adapting Video Large Language Models

초록

Support