Pegasus-v1 기술 보고서

초록

본 기술 보고서는 비디오 콘텐츠 이해와 자연어를 통한 상호작용에 특화된 멀티모달 언어 모델인 Pegasus-1을 소개합니다. Pegasus-1은 시공간 정보 해석과 같은 비디오 데이터가 제기하는 독특한 과제를 해결하기 위해 설계되었으며, 다양한 길이의 비디오 콘텐츠에 대한 세밀한 이해를 제공합니다. 이 기술 보고서는 Pegasus-1의 아키텍처, 훈련 전략, 그리고 비디오 대화, 제로샷 비디오 질의응답, 비디오 요약 분야에서의 벤치마크 성능을 개괄적으로 살펴봅니다. 또한 Pegasus-1의 질적 특성을 탐구함으로써 그 능력과 한계를 보여주고, 독자들에게 현재 상태와 미래 방향에 대한 균형 잡힌 시각을 제공합니다.

English

This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's architecture, training strategies, and its performance in benchmarks on video conversation, zero-shot video question answering, and video summarization. We also explore qualitative characteristics of Pegasus-1 , demonstrating its capabilities as well as its limitations, in order to provide readers a balanced view of its current state and its future direction.

Pegasus-v1 기술 보고서

Pegasus-v1 Technical Report

초록

Support