ROSE: Miglioramento della Segmentazione Orientata al Recupero

Abstract

I modelli di segmentazione esistenti basati su modelli linguistici multimodali di grandi dimensioni (MLLM), come LISA, spesso incontrano difficoltà con entità nuove o emergenti a causa della loro incapacità di incorporare conoscenze aggiornate. Per affrontare questa sfida, introduciamo il Novel Emerging Segmentation Task (NEST), che si concentra sulla segmentazione di (i) entità nuove che gli MLLM non riescono a riconoscere a causa della loro assenza dai dati di addestramento, e (ii) entità emergenti che esistono nella conoscenza del modello ma richiedono informazioni esterne aggiornate per un riconoscimento accurato. Per supportare lo studio del NEST, costruiamo un benchmark NEST utilizzando una pipeline automatizzata che genera campioni di dati relativi alle notizie per una valutazione completa. Inoltre, proponiamo ROSE: Retrieval-Oriented Segmentation Enhancement, un framework plug-and-play progettato per potenziare qualsiasi modello di segmentazione basato su MLLM. ROSE comprende quattro componenti chiave. Innanzitutto, viene introdotto un modulo di Internet Retrieval-Augmented Generation per impiegare input multimodali forniti dall'utente per recuperare informazioni web in tempo reale. Successivamente, un Textual Prompt Enhancer arricchisce il modello con informazioni aggiornate e ricche conoscenze di base, migliorando la sua capacità percettiva per le entità emergenti. Inoltre, viene proposto un Visual Prompt Enhancer per compensare la mancanza di esposizione degli MLLM alle entità nuove, sfruttando immagini provenienti da internet. Per mantenere l'efficienza, viene introdotto un modulo WebSense per decidere intelligentemente quando invocare i meccanismi di recupero in base all'input dell'utente. I risultati sperimentali dimostrano che ROSE aumenta significativamente le prestazioni sul benchmark NEST, superando una solida baseline di retrieval basata su Gemini-2.0 Flash di 19.2 punti gIoU.

English

Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model's knowledge but demand up-to-date external information for accurate recognition. To support the study of NEST, we construct a NEST benchmark using an automated pipeline that generates news-related data samples for comprehensive evaluation. Additionally, we propose ROSE: Retrieval-Oriented Segmentation Enhancement, a plug-and-play framework designed to augment any MLLM-based segmentation model. ROSE comprises four key components. First, an Internet Retrieval-Augmented Generation module is introduced to employ user-provided multimodal inputs to retrieve real-time web information. Then, a Textual Prompt Enhancer enriches the model with up-to-date information and rich background knowledge, improving the model's perception ability for emerging entities. Furthermore, a Visual Prompt Enhancer is proposed to compensate for MLLMs' lack of exposure to novel entities by leveraging internet-sourced images. To maintain efficiency, a WebSense module is introduced to intelligently decide when to invoke retrieval mechanisms based on user input. Experimental results demonstrate that ROSE significantly boosts performance on the NEST benchmark, outperforming a strong Gemini-2.0 Flash-based retrieval baseline by 19.2 in gIoU.