MM-VID: GPT-4V(ision)을 활용한 비디오 이해 기술의 발전

초록

우리는 GPT-4V의 능력과 시각, 오디오, 음성 분야의 전문 도구를 결합하여 고급 비디오 이해를 가능하게 하는 통합 시스템인 MM-VID를 소개합니다. MM-VID는 장편 비디오와 같은 도전적인 과제, 예를 들어 1시간 분량의 콘텐츠 내에서의 추론 및 여러 에피소드에 걸친 스토리라인 이해 등을 해결하기 위해 설계되었습니다. MM-VID는 GPT-4V를 활용한 비디오-스크립트 생성 방식을 사용하여 다중 모달 요소를 긴 텍스트 스크립트로 전사합니다. 생성된 스크립트는 캐릭터의 움직임, 행동, 표정, 대화 등을 상세히 기술함으로써 대형 언어 모델(LLM)이 비디오를 이해할 수 있는 기반을 마련합니다. 이를 통해 오디오 설명, 캐릭터 식별, 다중 모달 고급 이해와 같은 고급 기능이 가능해집니다. 실험 결과는 MM-VID가 다양한 길이와 장르의 비디오를 효과적으로 처리할 수 있음을 보여줍니다. 또한, 비디오 게임 및 그래픽 사용자 인터페이스와 같은 인터랙티브 환경에 적용했을 때의 잠재력을 시연합니다.

English

We present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding. MM-VID is designed to address the challenges posed by long-form videos and intricate tasks such as reasoning within hour-long content and grasping storylines spanning multiple episodes. MM-VID uses a video-to-script generation with GPT-4V to transcribe multimodal elements into a long textual script. The generated script details character movements, actions, expressions, and dialogues, paving the way for large language models (LLMs) to achieve video understanding. This enables advanced capabilities, including audio description, character identification, and multimodal high-level comprehension. Experimental results demonstrate the effectiveness of MM-VID in handling distinct video genres with various video lengths. Additionally, we showcase its potential when applied to interactive environments, such as video games and graphic user interfaces.

MM-VID: GPT-4V(ision)을 활용한 비디오 이해 기술의 발전

MM-VID: Advancing Video Understanding with GPT-4V(ision)

초록

Summary

Support