VideoPrism：用于视频理解的基础视觉编码器

摘要

我们介绍了VideoPrism，这是一个通用视频编码器，可以通过单个冻结模型处理各种视频理解任务。我们在一个包含3600万高质量视频-标题对和5.82亿视频剪辑的异构语料库上对VideoPrism进行了预训练，其中包含带有嘈杂平行文本（例如ASR转录）的视频剪辑。预训练方法通过全局-局部蒸馏语义视频嵌入和令牌重排方案改进了掩码自编码，使VideoPrism能够主要关注视频模态，同时利用与视频相关的宝贵文本。我们在四大类视频理解任务上对VideoPrism进行了广泛测试，从网络视频问答到科学计算机视觉，其中在33个视频理解基准测试中有30个取得了最先进的性能。

English

We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 30 out of 33 video understanding benchmarks.

VideoPrism：用于视频理解的基础视觉编码器

VideoPrism: A Foundational Visual Encoder for Video Understanding

摘要

Support