MMSearch：大規模モデルの可能性をマルチモーダル検索エンジンとしてベンチマークする

要旨

大規模言語モデル（LLMs）の出現により、AI検索エンジン（例：SearchGPT）が登場し、人間とインターネットの新しい相互作用のパラダイムを示しています。しかし、現在のほとんどのAI検索エンジンはテキストのみの設定に限定されており、ユーザーのマルチモーダルなクエリやテキストと画像が交互に配置されるウェブサイト情報が無視されています。最近、大規模マルチモーダルモデル（LMMs）が印象的な進展を遂げています。しかし、それらがAI検索エンジンとして機能できるかどうかは未だ探求されておらず、LMMsのマルチモーダル検索における潜在能力は未知数です。このため、まず、どのLMMsにもマルチモーダル検索機能を付与するために繊細なパイプラインであるMMSearch-Engineを設計します。さらに、LMMsのマルチモーダル検索性能を評価する包括的な評価ベンチマークであるMMSearchを紹介します。収集されたデータセットには、14のサブフィールドにわたる300の手動収集インスタンスが含まれており、現在のLMMsのトレーニングデータとは重複せず、正しい回答は検索のみで得られるようになっています。MMSearch-Engineを使用して、LMMsは再クエリ、再ランク、要約の3つの個別のタスク、および完全な検索プロセスを伴う1つの難解なエンドツーエンドタスクを実行することで評価されます。我々は、クローズドソースおよびオープンソースのLMMsについて広範な実験を行います。すべてのテストされたモデルの中で、MMSearch-Engineを使用したGPT-4oが最良の結果を達成し、商用製品であるPerplexity Proを上回り、エンドツーエンドタスクで効果を示しています。現在のLMMsがまだマルチモーダル検索タスクを完全に把握するのに苦労していることを明らかにするエラー分析を提示し、AI検索エンジンのテスト時計算のスケーリングの可能性を示す除去実験を実施します。MMSearchがマルチモーダルAI検索エンジンの将来の開発を導くための独自の洞察を提供できることを期待しています。プロジェクトページ：https://mmsearch.github.io

English

The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine. Project Page: https://mmsearch.github.io

MMSearch：大規模モデルの可能性をマルチモーダル検索エンジンとしてベンチマークする

MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines

要旨

Support