Pegasus-v1 技術レポート

要旨

本技術レポートでは、ビデオコンテンツの理解と自然言語によるインタラクションに特化したマルチモーダル言語モデル「Pegasus-1」を紹介する。Pegasus-1は、時空間情報の解釈など、ビデオデータが持つ特有の課題に対処するために設計されており、様々な長さのビデオコンテンツに対する微妙な理解を提供する。本レポートでは、Pegasus-1のアーキテクチャ、トレーニング戦略、およびビデオ会話、ゼロショットビデオ質問応答、ビデオ要約におけるベンチマーク性能を概説する。また、Pegasus-1の定性的特性を探り、その能力と限界を示すことで、読者に現在の状態と将来の方向性についてバランスの取れた視点を提供する。

English

This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's architecture, training strategies, and its performance in benchmarks on video conversation, zero-shot video question answering, and video summarization. We also explore qualitative characteristics of Pegasus-1 , demonstrating its capabilities as well as its limitations, in order to provide readers a balanced view of its current state and its future direction.

Pegasus-v1 技術レポート

Pegasus-v1 Technical Report

要旨

Support