X-CoT: LLMベースの連鎖的思考推論による説明可能なテキスト-映像検索

要旨

主流のテキスト-ビデオ検索システムは、主に特徴抽出のための埋め込みモデルを採用し、ランキングのためにコサイン類似度を計算しています。しかし、この設計には2つの制限があります。低品質のテキスト-ビデオデータペアが検索を損なう可能性があるものの、それらを特定し検証することは困難です。また、コサイン類似度だけではランキング結果の説明がなく、解釈可能性が制限されます。我々は、ランキング結果を解釈し、検索モデルを評価し、テキスト-ビデオデータを検証することができるかどうかを問います。本論文では、埋め込みモデルベースの類似度ランキングの代わりに、LLM CoT推論に基づく説明可能な検索フレームワークであるX-CoTを提案します。まず、既存のベンチマークに追加のビデオアノテーションを拡張し、セマンティック理解をサポートし、データバイアスを軽減します。また、ペアワイズ比較ステップからなる検索CoTを考案し、詳細な推論と完全なランキングを生成します。X-CoTは、検索性能を実証的に向上させ、詳細な根拠を生成します。さらに、モデルの挙動とデータ品質の分析を容易にします。コードとデータは以下で利用可能です: https://github.com/PrasannaPulakurthi/X-CoT。

English

Prevalent text-to-video retrieval systems mainly adopt embedding models for feature extraction and compute cosine similarities for ranking. However, this design presents two limitations. Low-quality text-video data pairs could compromise the retrieval, yet are hard to identify and examine. Cosine similarity alone provides no explanation for the ranking results, limiting the interpretability. We ask that can we interpret the ranking results, so as to assess the retrieval models and examine the text-video data? This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning in place of the embedding model-based similarity ranking. We first expand the existing benchmarks with additional video annotations to support semantic understanding and reduce data bias. We also devise a retrieval CoT consisting of pairwise comparison steps, yielding detailed reasoning and complete ranking. X-CoT empirically improves the retrieval performance and produces detailed rationales. It also facilitates the model behavior and data quality analysis. Code and data are available at: https://github.com/PrasannaPulakurthi/X-CoT.

X-CoT: LLMベースの連鎖的思考推論による説明可能なテキスト-映像検索

X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning

要旨

Support