MMSearch:评估大型模型作为多模态搜索引擎的潜力
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
September 19, 2024
作者: Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, Hongsheng Li
cs.AI
摘要
大型语言模型(LLMs)的出现为人工智能搜索引擎,例如SearchGPT,开辟了一种新的人机互动范式。然而,目前大多数人工智能搜索引擎仅限于文本设置,忽略了多模态用户查询以及网站信息的文本-图像交替性质。最近,大型多模态模型(LMMs)取得了令人瞩目的进展。然而,它们能否作为人工智能搜索引擎运行仍未得到充分探讨,使得LMMs在多模态搜索中的潜力成为一个悬而未决的问题。为此,我们首先设计了一个精心构建的流程,MMSearch-Engine,以赋予任何LMMs多模态搜索功能。在此基础上,我们引入了MMSearch,一个全面评估LMMs多模态搜索性能的基准。精心策划的数据集包含300个手动收集的实例,涵盖14个子领域,与当前LMMs的训练数据无重叠,确保只能在搜索中获得正确答案。通过使用MMSearch-Engine,LMMs通过执行三个单独任务(重新查询、重新排名和摘要生成)以及一个具有完整搜索过程的具有挑战性的端到端任务进行评估。我们对闭源和开源LMMs进行了广泛实验。在所有测试模型中,具有MMSearch-Engine的GPT-4o取得了最佳结果,在端到端任务中超越了商业产品Perplexity Pro,展示了我们提出的流程的有效性。我们进一步进行错误分析,揭示当前LMMs仍然在完全掌握多模态搜索任务方面存在困难,并进行消融研究,表明在人工智能搜索引擎中扩展测试时间计算的潜力。我们希望MMSearch可以提供独特的见解,指导未来多模态人工智能搜索引擎的发展。项目页面:https://mmsearch.github.io
English
The advent of Large Language Models (LLMs) has paved the way for AI search
engines, e.g., SearchGPT, showcasing a new paradigm in human-internet
interaction. However, most current AI search engines are limited to text-only
settings, neglecting the multimodal user queries and the text-image interleaved
nature of website information. Recently, Large Multimodal Models (LMMs) have
made impressive strides. Yet, whether they can function as AI search engines
remains under-explored, leaving the potential of LMMs in multimodal search an
open question. To this end, we first design a delicate pipeline,
MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On
top of this, we introduce MMSearch, a comprehensive evaluation benchmark to
assess the multimodal search performance of LMMs. The curated dataset contains
300 manually collected instances spanning 14 subfields, which involves no
overlap with the current LMMs' training data, ensuring the correct answer can
only be obtained within searching. By using MMSearch-Engine, the LMMs are
evaluated by performing three individual tasks (requery, rerank, and
summarization), and one challenging end-to-end task with a complete searching
process. We conduct extensive experiments on closed-source and open-source
LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best
results, which surpasses the commercial product, Perplexity Pro, in the
end-to-end task, demonstrating the effectiveness of our proposed pipeline. We
further present error analysis to unveil current LMMs still struggle to fully
grasp the multimodal search tasks, and conduct ablation study to indicate the
potential of scaling test-time computation for AI search engine. We hope
MMSearch may provide unique insights to guide the future development of
multimodal AI search engine. Project Page: https://mmsearch.github.ioSummary
AI-Generated Summary