飛馬-v1 技術報告

摘要

本技術報告介紹了Pegasus-1，這是一個專注於視頻內容理解和通過自然語言進行互動的多模式語言模型。Pegasus-1的設計旨在應對視頻數據帶來的獨特挑戰，例如解釋時空信息，以提供跨不同長度的細緻視頻內容理解。本技術報告概述了Pegasus-1的架構、訓練策略，以及在視頻對話、零樣本視頻問答和視頻摘要等基準測試中的表現。我們還探討了Pegasus-1的定性特徵，展示其能力以及局限性，以便為讀者提供對其當前狀態和未來方向的平衡觀點。

English

This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's architecture, training strategies, and its performance in benchmarks on video conversation, zero-shot video question answering, and video summarization. We also explore qualitative characteristics of Pegasus-1 , demonstrating its capabilities as well as its limitations, in order to provide readers a balanced view of its current state and its future direction.

飛馬-v1 技術報告

Pegasus-v1 Technical Report

摘要

Support