신속 샘플링: 테일러 급수를 통한 시간적 돌발점 선택

초록

대부분의 롱폼 비디오 프레임은 중복되지만, 중요한 정보는 시간적 예외, 즉 실제 시각적 특징이 예측된 진화에서 벗어나는 순간에 존재합니다. 인간 두뇌의 예측 코딩에서 영감을 받아, 우리는 Swift Sampling을 제안한다. 이는 비디오에서 정보량이 높은 순간을 자동으로 식별하는 우아하고 학습이 필요 없는 프레임 선택 알고리즘이다. 구체적으로, 비디오를 시각적 잠재 공간에서의 미분 가능한 궤적으로 모델링하고 특징의 속도와 가속도를 계산한다. 그런 다음 테일러 전개를 적용하여 후속 프레임의 예상 경로를 투영한다. 예측된 다양체에서 급격히 벗어나는 프레임은 시간적으로 예외적인 프레임으로 식별되어 샘플링 대상이 된다. 보조 네트워크나 비디오별 하이퍼파라미터 튜닝에 의존하는 기존의 학습 없는 방법과 달리, Swift Sampling은 매우 가벼워 기준 대비 0.02배의 추가 계산 비용만 발생시켜 주요 기준선 대비 오버헤드를 30배 저렴하게 만든다. 세 가지 장편 비디오 질의응답 벤치마크와 10개의 다양한 하위 작업에서 Swift Sampling은 균일 샘플링 및 기존 쿼리 무관 기준선보다 우수한 성능을 보인다. 특히 프레임 예산이 제한된 장편 비디오에서 최대 +12.5포인트까지 정확도를 향상시킨다.

English

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.