CAST: 一貫性のあるビデオ検索のための視覚的状態遷移のモデリング

要旨

動画コンテンツ制作が長編ナラティブへと移行する中、短いクリップを首尾一貫したストーリーラインに構成することの重要性が高まっている。しかし、現在主流の検索手法は、推論時に文脈を考慮せず、局所的な意味的整合性を優先する一方で、状態やアイデンティティの一貫性を軽視している。この構造的限界に対処するため、我々は一貫性のある動画検索（Consistent Video Retrieval: CVR）タスクを形式化し、YouCook2、COIN、CrossTaskにまたがる診断ベンチマークを導入する。さらに、多様な凍結された視覚言語埋め込み空間と互換性のある、軽量でプラグアンドプレイ可能なアダプタであるCAST（Context-Aware State Transition）を提案する。CASTは、視覚的履歴から状態を条件付けた残差更新（Δ）を予測することにより、潜在状態の遷移に対して明示的な帰納的バイアスを導入する。大規模な実験により、CASTがYouCook2およびCrossTaskでの性能を向上させ、COINでは競争力を維持し、様々な基盤バックボーンにおいてゼロショットベースラインを一貫して上回ることを示す。さらに、CASTはブラックボックス型の動画生成候補（例：Veoからの出力）に対して有用な再ランキング信号を提供し、より時間的に一貫性のある続編の生成を促進する。

English

As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update (Δ) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.

CAST: 一貫性のあるビデオ検索のための視覚的状態遷移のモデリング

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

要旨

Support