CAST：通过视觉状态转换建模实现一致性视频检索

摘要

随着视频内容创作日益趋向长叙事形式，将短片片段组合成连贯故事线的重要性愈发凸显。然而当前主流的检索方法在推理时仍缺乏上下文感知，过度关注局部语义对齐而忽视了状态与身份一致性。为突破这一结构性局限，我们正式提出连贯视频检索任务，并构建了覆盖YouCook2、COIN和CrossTask的诊断基准。我们推出CAST（上下文感知状态转换器）——一种轻量级即插即用适配器，可兼容多种冻结的视觉语言嵌入空间。通过从视觉历史预测状态条件残差更新(Δ)，CAST为潜在状态演化引入了显式归纳偏置。大量实验表明，CAST在YouCook2和CrossTask上实现性能提升，在COIN数据集保持竞争力，并在不同基础骨干网络中持续超越零样本基线。此外，CAST能为黑箱视频生成候选结果（如Veo）提供有效的重排序信号，促进时间连贯性更强的续写生成。

English

As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update (Δ) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.

CAST：通过视觉状态转换建模实现一致性视频检索

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

摘要

Support