CAST: 일관된 비디오 검색을 위한 시각적 상태 전이 모델링

초록

비디오 콘텐츠 제작이 장편 서사 중심으로 전환됨에 따라, 짧은 클립을 일관된 스토리라인으로 구성하는 능력이 점점 더 중요해지고 있습니다. 그러나 현재 널리 사용되는 검색 방식은 추론 시점에 문맥을 고려하지 않아, 지역적 의미론적 정렬을 우선시하는 반면 상태와 정체성의 일관성을 간과하고 있습니다. 이러한 구조적 한계를 해결하기 위해, 본 연구는 일관된 비디오 검색(CVR) 작업을 공식화하고 YouCook2, COIN, CrossTask에 걸친 진단 벤치마크를 소개합니다. 우리는 다양한 고정된 시각-언어 임베딩 공간과 호환되는 경량의 플러그앤플레이 어댑터인 CAST(Context-Aware State Transition)를 제안합니다. CAST는 시각적 기록으로부터 상태 조건부 잔차 업데이트(Δ)를 예측함으로써, 잠재 상태 변화에 대한 명시적인 귀납적 편향을 도입합니다. 광범위한 실험을 통해 CAST가 YouCook2와 CrossTask에서 성능을 향상시키고, COIN에서는 경쟁력을 유지하며, 다양한 파운데이션 백본 전반에 걸쳐 제로샷 기준선을 꾸준히 능가함을 확인했습니다. 더 나아가, CAST는 블랙박스 비디오 생성 후보(예: Veo)에 대한 유용한 재순위 지정 신호를 제공하여 시간적으로 더 일관된 연속 장면 생성을 촉진합니다.

English

As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update (Δ) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.

CAST: 일관된 비디오 검색을 위한 시각적 상태 전이 모델링

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

초록

Support