MovieLLM: AI 생성 영화를 통한 장편 비디오 이해력 강화

초록

멀티모달 모델의 발전은 기계가 비디오를 이해하는 방식에 있어 중요한 진전을 이루었습니다. 이러한 모델들은 짧은 비디오 클립을 분석하는 데 유망한 성과를 보여주었습니다. 그러나 영화와 같은 긴 형식의 비디오에 대해서는 종종 한계를 보입니다. 주요 장애물은 고품질이고 다양한 비디오 데이터의 부족과 이러한 데이터를 수집하거나 주석을 달기 위해 필요한 집중적인 작업입니다. 이러한 도전 과제에 직면하여, 우리는 긴 비디오를 위한 합성적이고 고품질의 데이터를 생성하기 위해 설계된 새로운 프레임워크인 MovieLLM을 제안합니다. 이 프레임워크는 GPT-4와 텍스트-이미지 모델의 힘을 활용하여 상세한 스크립트와 해당 시각 자료를 생성합니다. 우리의 접근 방식은 유연성과 확장성으로 인해 전통적인 데이터 수집 방법보다 우수한 대안으로 두드러집니다. 우리의 광범위한 실험은 MovieLLM이 생성한 데이터가 복잡한 비디오 내러티브를 이해하는 데 있어 멀티모달 모델의 성능을 크게 향상시키며, 기존 데이터셋의 부족과 편향성이라는 한계를 극복한다는 것을 검증합니다.

English

The development of multimodal models has marked a significant step forward in how machines understand videos. These models have shown promise in analyzing short video clips. However, when it comes to longer formats like movies, they often fall short. The main hurdles are the lack of high-quality, diverse video data and the intensive work required to collect or annotate such data. In the face of these challenges, we propose MovieLLM, a novel framework designed to create synthetic, high-quality data for long videos. This framework leverages the power of GPT-4 and text-to-image models to generate detailed scripts and corresponding visuals. Our approach stands out for its flexibility and scalability, making it a superior alternative to traditional data collection methods. Our extensive experiments validate that the data produced by MovieLLM significantly improves the performance of multimodal models in understanding complex video narratives, overcoming the limitations of existing datasets regarding scarcity and bias.

MovieLLM: AI 생성 영화를 통한 장편 비디오 이해력 강화

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

초록

Support