Swift Sampling: テイラー級数による時間的サプライズの選択

要旨

長尺動画のほとんどのフレームは冗長ですが、重要な情報は時間的なサプライズ、すなわち実際の視覚的特徴が予測された進化から逸脱する瞬間に存在します。人間の脳の予測コーディングに着想を得て、我々はSwift Samplingを導入します。これはエレガントでトレーニング不要のフレーム選択アルゴリズムであり、動画内の情報量の多い瞬間を自動的に特定します。具体的には、動画を視覚的潜在空間における微分可能な軌跡としてモデル化し、その特徴の速度と加速度を計算します。次に、テイラー展開を適用して後続フレームの期待される経路を予測します。この予測多様体から大きく逸脱するフレームは時間的にサプライズなフレームとして識別され、サンプリングのために選択されます。補助ネットワークや動画固有のハイパーパラメータ調整に依存する従来のトレーニング不要の手法とは異なり、Swift Samplingは非常に軽量で、ベースラインに対して0.02倍の追加計算コストしか追加せず、主要なベースラインよりも30倍低いオーバーヘッドを実現します。3つの長尺動画質問応答ベンチマークと10の異なるダウンストリームタスクにおいて、Swift Samplingは一様サンプリングおよび従来のクエリ非依存ベースラインを上回ります。特にフレーム予算が限られた長尺動画で威力を発揮し、精度を最大+12.5ポイント向上させます。

English

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.