飞马-v1 技术报告

摘要

本技术报告介绍了Pegasus-1，这是一种专门用于视频内容理解和通过自然语言进行交互的多模态语言模型。Pegasus-1的设计旨在解决视频数据带来的独特挑战，例如解释时空信息，以提供对各种长度的视频内容的细致理解。本技术报告概述了Pegasus-1的架构、训练策略以及在视频对话、零样本视频问答和视频摘要等基准测试中的性能。我们还探讨了Pegasus-1的定性特征，展示其能力以及局限性，以便为读者提供关于其当前状态和未来方向的平衡观点。

English

This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's architecture, training strategies, and its performance in benchmarks on video conversation, zero-shot video question answering, and video summarization. We also explore qualitative characteristics of Pegasus-1 , demonstrating its capabilities as well as its limitations, in order to provide readers a balanced view of its current state and its future direction.

飞马-v1 技术报告

Pegasus-v1 Technical Report

摘要

Support