快速採樣：以泰勒級數選取時序驚喜

摘要

长影片中的多数帧都带有冗余信息，关键信息往往存在于时间上的惊喜时刻：即实际视觉特征偏离其预测演变轨迹的瞬间。受人类大脑预测编码机制的启发，我们提出Swift Sampling——一种简洁且无需训练的帧选取算法，可自动识别影片中的高信息量时刻。具体而言，我们将影片建模为视觉潜空间中的可微分轨迹，并计算其特征的速度与加速度。接着，运用泰勒展开预测后续帧的期望路径。那些与预测流形严重偏离的帧，即被识别为时间上的惊喜帧并予以选取。相较于先前依赖辅助网络或视频特定超参数调优的免训练方法，Swift Sampling极为轻量，在基准线上仅增加0.02倍计算成本，使其开销比领先基准低30倍。在三个长影片问答基准测试与10个不同下游任务中，Swift Sampling均优于均匀采样及过往的查询无关基准方法。尤其对于帧预算有限的长影片，其准确率提升最多可达12.5个百分点。

English

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.