PG-Video-LLaVA: 픽셀 기반 대형 비디오-언어 모델

초록

이미지 기반 대형 멀티모달 모델(LMM)을 비디오로 확장하는 것은 비디오 데이터의 고유한 복잡성으로 인해 어려운 과제입니다. 최근 이미지 기반 LMM을 비디오로 확장한 접근 방식들은 그라운딩 기능이 부족하거나(예: VideoChat, Video-ChatGPT, Video-LLaMA), 더 나은 비디오 이해를 위해 오디오 신호를 활용하지 못하는(예: Video-ChatGPT) 한계가 있습니다. 이러한 격차를 해결하기 위해, 우리는 픽셀 수준의 그라운딩 기능을 갖춘 최초의 LMM인 Video-LLaVA를 제안하며, 오디오 신호를 텍스트로 변환하여 비디오 컨텍스트 이해를 풍부하게 합니다. 우리의 프레임워크는 기존의 트래커와 새로운 그라운딩 모듈을 사용하여 사용자 지시에 따라 비디오 내 객체를 공간적 및 시간적으로 위치 지정할 수 있습니다. 우리는 Video-LLaVA를 비디오 기반 생성 및 질의응답 벤치마크를 사용하여 평가하고, 비디오에서 프롬프트 기반 객체 그라운딩 성능을 측정하기 위해 특별히 설계된 새로운 벤치마크를 소개합니다. 또한, Video-ChatGPT에서 사용된 GPT-3.5 대신 Vicuna를 사용하여 비디오 기반 대화 벤치마킹을 제안하며, GPT-3.5의 독점적 특성으로 인한 재현성 문제를 해결합니다. 우리의 프레임워크는 최신 이미지 기반 LLaVA 모델을 기반으로 하며, 그 장점을 비디오 영역으로 확장하여 비디오 기반 대화 및 그라운딩 작업에서 유망한 성과를 제공합니다. 프로젝트 페이지: https://github.com/mbzuai-oryx/Video-LLaVA

English

Extending image-based Large Multimodal Models (LMM) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMM to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially and temporally localize objects in videos following user instructions. We evaluate Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA

PG-Video-LLaVA: 픽셀 기반 대형 비디오-언어 모델

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

초록

Support