ROSE: 検索指向型セグメンテーション強化

要旨

既存のマルチモーダル大規模言語モデル（MLLM）に基づくセグメンテーション手法（LISAなど）は、最新の知識を組み込むことができないため、新規出現エンティティの処理に課題を抱えている。この課題に対処するため、我々は新規出現セグメンテーションタスク（NEST）を提案する。NESTは以下の2種類のエンティティのセグメンテーションに焦点を当てる：（i）学習データに存在しないためMLLMが認識できない新規エンティティ、（ii）モデルの知識内には存在するが、正確な認識のために最新の外部情報を必要とする出現エンティティである。NESTの研究を支援するため、ニュース関連データサンプルを自動生成するパイプラインを用いてNESTベンチマークを構築した。さらに、任意のMLLMベースのセグメンテーションモデルを拡張可能なプラグアンドプレイフレームワークであるROSE（Retrieval-Oriented Segmentation Enhancement）を提案する。ROSEは4つの主要コンポーネントで構成される。まず、ユーザー提供のマルチモーダル入力を用いてリアルタイムのウェブ情報を取得するインターネット検索拡張生成モジュールを導入する。次に、テキストプロンプト拡張器が最新情報と豊富な背景知識をモデルに付与し、出現エンティティに対するモデルの認識能力を向上させる。さらに、視覚プロンプト拡張器は、インターネットから取得した画像を活用することで、MLLMが新規エンティティに曝露されていない問題を補完する。効率性を維持するため、ユーザー入力に基づいて検索機構の起動を智能的に判断するWebSenseモジュールを導入する。実験結果により、ROSEがNESTベンチマークにおいて性能を大幅に向上させ、強力なGemini-2.0 Flashベースの検索ベースラインをgIoUで19.2ポイント上回ることを実証した。

English

Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model's knowledge but demand up-to-date external information for accurate recognition. To support the study of NEST, we construct a NEST benchmark using an automated pipeline that generates news-related data samples for comprehensive evaluation. Additionally, we propose ROSE: Retrieval-Oriented Segmentation Enhancement, a plug-and-play framework designed to augment any MLLM-based segmentation model. ROSE comprises four key components. First, an Internet Retrieval-Augmented Generation module is introduced to employ user-provided multimodal inputs to retrieve real-time web information. Then, a Textual Prompt Enhancer enriches the model with up-to-date information and rich background knowledge, improving the model's perception ability for emerging entities. Furthermore, a Visual Prompt Enhancer is proposed to compensate for MLLMs' lack of exposure to novel entities by leveraging internet-sourced images. To maintain efficiency, a WebSense module is introduced to intelligently decide when to invoke retrieval mechanisms based on user input. Experimental results demonstrate that ROSE significantly boosts performance on the NEST benchmark, outperforming a strong Gemini-2.0 Flash-based retrieval baseline by 19.2 in gIoU.