X-CoT：基于大语言模型链式推理的可解释性文本到视频检索

摘要

当前主流的文本到视频检索系统主要采用嵌入模型进行特征提取，并通过计算余弦相似度进行排序。然而，这一设计存在两个局限性：低质量的文本-视频数据对可能影响检索效果，却难以识别和检验；仅依赖余弦相似度无法解释排序结果，限制了模型的可解释性。我们提出，能否通过解释排序结果来评估检索模型并检验文本-视频数据？本工作提出了X-CoT，一种基于大语言模型（LLM）链式推理（CoT）的可解释检索框架，以替代基于嵌入模型的相似度排序。我们首先扩展了现有基准数据集，增加了视频注释以支持语义理解并减少数据偏差。同时，我们设计了一种包含成对比较步骤的检索链式推理，生成详细的推理过程和完整的排序结果。X-CoT在实证中提升了检索性能，并提供了详尽的推理依据。此外，它还促进了模型行为分析和数据质量评估。代码与数据已公开于：https://github.com/PrasannaPulakurthi/X-CoT。

English

Prevalent text-to-video retrieval systems mainly adopt embedding models for feature extraction and compute cosine similarities for ranking. However, this design presents two limitations. Low-quality text-video data pairs could compromise the retrieval, yet are hard to identify and examine. Cosine similarity alone provides no explanation for the ranking results, limiting the interpretability. We ask that can we interpret the ranking results, so as to assess the retrieval models and examine the text-video data? This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning in place of the embedding model-based similarity ranking. We first expand the existing benchmarks with additional video annotations to support semantic understanding and reduce data bias. We also devise a retrieval CoT consisting of pairwise comparison steps, yielding detailed reasoning and complete ranking. X-CoT empirically improves the retrieval performance and produces detailed rationales. It also facilitates the model behavior and data quality analysis. Code and data are available at: https://github.com/PrasannaPulakurthi/X-CoT.

X-CoT：基于大语言模型链式推理的可解释性文本到视频检索

X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning

摘要

Support