코알라: 키 프레임 조건부 장영상-LLM

초록

긴 영상 질의응답은 단기 활동을 인식하고 이들의 세부적인 관계를 추론해야 하는 도전적인 과제입니다. 최첨단 비디오 대형 언어 모델(vLLM)은 새로운 과제에서 나타난 창발적 능력으로 인해 유망한 해결책으로 여겨집니다. 그러나 수백만 개의 짧은 초 단위 영상으로 학습되었음에도 불구하고, vLLM은 수 분 길이의 영상을 이해하고 이에 대한 질문에 정확히 답변하는 데 한계를 보입니다. 이러한 한계를 극복하기 위해, 우리는 사전 학습된 vLLM이 더 긴 영상으로 일반화할 수 있도록 학습 가능한 시공간적 쿼리를 도입하는 경량화된 자기 지도 학습 방식인 Key frame-conditioned long video-LLM (Koala)을 제안합니다. 우리의 접근 방식은 희소한 영상 키 프레임에서 계산된 시각적 토큰에 기반한 두 가지 새로운 토크나이저를 도입하여 짧고 긴 영상 순간을 이해합니다. 우리는 HowTo100M 데이터셋에서 제안된 방식을 학습시키고, 제로샷 긴 영상 이해 벤치마크에서 최첨단 대형 모델보다 모든 과제에서 3~6% 절대 정확도로 우수한 성능을 입증했습니다. 흥미롭게도, 우리의 접근 방식은 사전 학습된 vLLM이 긴 영상을 이해하는 데 도움을 줄 뿐만 아니라 단기 행동 인식 정확도도 향상시킨다는 것을 실증적으로 보여줍니다.

English

Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation, we propose a lightweight and self-supervised approach, Key frame-conditioned long video-LLM (Koala), that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.

코알라: 키 프레임 조건부 장영상-LLM

Koala: Key frame-conditioned long video-LLM

초록

Support