ビジョン検索アシスタント：ビジョン言語モデルをマルチモーダル検索エンジンとして強化する

要旨

検索エンジンは、テキストを用いて未知の情報を取得することを可能にします。しかし、伝統的な方法は、見慣れない視覚コンテンツを理解する際には不十分であり、例えばモデルが以前に見たことのないオブジェクトを識別する場合などに問題が生じます。この課題は、大規模なビジョン言語モデル（VLMs）にとって特に顕著です。もしモデルが画像に描かれたオブジェクトに触れたことがない場合、その画像に関するユーザーの質問に信頼性のある回答を生成することが難しくなります。さらに、新しいオブジェクトやイベントが継続的に現れる中、VLMsを頻繁に更新することは、高い計算負荷のため実用的ではありません。この制限に対処するために、我々はVision Search Assistantを提案します。これは、VLMsとWebエージェントの協力を促進する新しいフレームワークです。このアプローチは、VLMsの視覚理解能力とWebエージェントのリアルタイム情報アクセスを活用し、Webを介したオープンワールドの検索増強生成を行います。この協力を通じて視覚的およびテキスト表現を統合することで、システムにとって画像が新しい場合でも、モデルは情報を提供できます。オープンセットおよびクローズドセットのQAベンチマークで実施された幅広い実験は、Vision Search Assistantが他のモデルを大幅に上回り、既存のVLMsに広く適用できることを示しています。

English

Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user's question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world Retrieval-Augmented Generation via the web. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system. Extensive experiments conducted on both open-set and closed-set QA benchmarks demonstrate that the Vision Search Assistant significantly outperforms the other models and can be widely applied to existing VLMs.