快速采样：基于泰勒级数的时间意外选择

摘要

长视频中的大部分帧存在冗余，关键信息往往蕴含在时间上的意外时刻——即实际视觉特征偏离其预测演进模式的瞬间。受人类大脑预测编码机制的启发，我们提出Swift Sampling，一种优雅且无需训练的帧选择算法，能够自动识别视频中的高信息量时刻。具体而言，我们将视频建模为视觉隐空间中的可微轨迹，计算其特征的速度与加速度，进而利用泰勒展开预测后续帧的预期轨迹。与预测流形发生显著偏离的帧被判定为时间意外帧，并纳入采样。不同于此前依赖辅助网络或视频特定超参数调优的免训练方法，Swift Sampling极其轻量，仅引入相对于基线模型0.02倍的额外计算成本，相比主流方法开销降低30倍。在三个长视频问答基准测试及十项不同下游任务中，Swift Sampling在性能上全面优于均匀采样及现有查询无关基线方法，在帧预算受限的长视频场景中尤为突出，可将准确率提升高达12.5个百分点。

English

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.