マルチモーダル検索のための推論拡張表現

要旨

ユニバーサルマルチモーダル検索（UMR）は、テキストと視覚情報にわたる任意間検索を目指すが、現代の埋め込みモデルは、クエリが潜在的な推論（例：未特定の参照の解決や合成的制約のマッチング）を必要とする場合に依然として脆弱である。この脆弱性は、多くの場合データに起因すると我々は主張する：画像が「暗黙的」な証拠を含み、クエリが重要な意味を暗黙的に残す場合、単一の埋め込み処理が推論と圧縮を同時に行わなければならず、偽りの特徴マッチングを促進してしまう。我々は、検索前に推論を外部化することでこれらの役割を分離する、データ中心のフレームワークを提案する。強力な視覚言語モデルを用いて、コーパスエントリ内の視覚的証拠を高密度にキャプション化し、クエリ内のあいまいなマルチモーダル参照を解決し、冗長な指示を簡潔な検索制約に書き換えることで、暗黙的な意味を明示化する。推論時の拡張だけでは不十分であり、分布シフトを回避し、追加された信号を十分に活用するためには、検索器をこれらの意味的に高密度な表現で訓練する必要がある。M-BEIRにわたる実験では、我々の推論拡張訓練手法は強力なベースラインを一貫して上回り、 ablation study により、コーパス拡張は主に知識集約型クエリに利益をもたらす一方、クエリ拡張は合成的変更要求に対して極めて重要であることが示された。コードは https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval で公開している。

English

Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data-induced: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision--Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference-time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M-BEIR, our reasoning-augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge-intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval.

マルチモーダル検索のための推論拡張表現

Reasoning-Augmented Representations for Multimodal Retrieval

要旨

Support