MMSearch: Valutazione delle potenzialità dei grandi modelli come motori di ricerca multi-modali

Abstract

L'avvento dei Grandi Modelli Linguistici (LLM) ha aperto la strada ai motori di ricerca AI, ad esempio SearchGPT, mostrando un nuovo paradigma nell'interazione umano-internet. Tuttavia, la maggior parte dei motori di ricerca AI attuali è limitata alle impostazioni solo testuali, trascurando le interrogazioni multimodali degli utenti e la natura testo-immagine intercalata delle informazioni sui siti web. Di recente, i Grandi Modelli Multimodali (LMM) hanno compiuto progressi impressionanti. Tuttavia, se possano funzionare come motori di ricerca AI rimane poco esplorato, lasciando aperta la questione del potenziale dei LMM nella ricerca multimodale. A questo scopo, progettiamo innanzitutto un delicato pipeline, MMSearch-Engine, per dotare qualsiasi LMM di capacità di ricerca multimodale. Inoltre, introduciamo MMSearch, un benchmark di valutazione completo per valutare le prestazioni di ricerca multimodale dei LMM. Il dataset curato contiene 300 istanze raccolte manualmente che coprono 14 sottocampi, senza sovrapposizione con i dati di addestramento attuali dei LMM, garantendo che la risposta corretta possa essere ottenuta solo tramite la ricerca. Utilizzando MMSearch-Engine, i LMM sono valutati eseguendo tre compiti individuali (ricerca ripetuta, riorientamento e riassunto) e un complesso compito end-to-end con un processo di ricerca completo. Conduci...

English

The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine. Project Page: https://mmsearch.github.io

MMSearch: Valutazione delle potenzialità dei grandi modelli come motori di ricerca multi-modali

MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines

Abstract

Summary

Support

Support