X-CoT：基於大型語言模型思維鏈推理的可解釋性文本到視頻檢索

摘要

現有的文本到視頻檢索系統主要採用嵌入模型進行特徵提取，並通過計算餘弦相似度來進行排序。然而，這種設計存在兩個局限性。低質量的文本-視頻數據對可能會影響檢索效果，但卻難以識別和檢查。僅靠餘弦相似度無法對排序結果提供解釋，限制了可解釋性。我們提出疑問：能否解釋排序結果，從而評估檢索模型並檢查文本-視頻數據？本研究提出了X-CoT，這是一個基於大型語言模型（LLM）推理的可解釋檢索框架，取代了基於嵌入模型的相似度排序。我們首先擴展現有的基準數據集，增加視頻註釋以支持語義理解並減少數據偏差。我們還設計了一個包含成對比較步驟的檢索推理鏈（CoT），生成詳細的推理過程和完整的排序結果。X-CoT在實驗中提升了檢索性能，並產生了詳細的推理依據。它還促進了模型行為和數據質量的分析。代碼和數據可在以下網址獲取：https://github.com/PrasannaPulakurthi/X-CoT。

English

Prevalent text-to-video retrieval systems mainly adopt embedding models for feature extraction and compute cosine similarities for ranking. However, this design presents two limitations. Low-quality text-video data pairs could compromise the retrieval, yet are hard to identify and examine. Cosine similarity alone provides no explanation for the ranking results, limiting the interpretability. We ask that can we interpret the ranking results, so as to assess the retrieval models and examine the text-video data? This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning in place of the embedding model-based similarity ranking. We first expand the existing benchmarks with additional video annotations to support semantic understanding and reduce data bias. We also devise a retrieval CoT consisting of pairwise comparison steps, yielding detailed reasoning and complete ranking. X-CoT empirically improves the retrieval performance and produces detailed rationales. It also facilitates the model behavior and data quality analysis. Code and data are available at: https://github.com/PrasannaPulakurthi/X-CoT.

X-CoT：基於大型語言模型思維鏈推理的可解釋性文本到視頻檢索

X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning

摘要

Support