ChatPaper.aiChatPaper

自回归视频记忆压缩中的预训练框架保持

Pretraining Frame Preservation in Autoregressive Video Memory Compression

December 29, 2025
作者: Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala
cs.AI

摘要

我们提出PFP神经网络结构,该结构能够将长视频压缩为短上下文,其显式预训练目标是在任意时间位置保留单帧的高频细节。基线模型可将20秒视频压缩至约5k长度的上下文,其中随机帧的提取能够保持感知层面的外观完整性。此类预训练模型可直接微调为自回归视频模型的记忆编码器,实现以低上下文成本存储长时记忆,且保真度损失相对较低。我们通过消融实验评估该框架,并探讨不同神经网络架构设计的权衡关系。
English
We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.
PDF51January 2, 2026