InstructPart: Segmentazione Orientata al Compito con Ragionamento sulle Istruzioni

Abstract

I grandi modelli fondazionali multimodali, in particolare nei domini del linguaggio e della visione, hanno fatto progressi significativi in vari compiti, tra cui la robotica, la guida autonoma, il recupero delle informazioni e il grounding. Tuttavia, molti di questi modelli percepiscono gli oggetti come indivisibili, trascurando i componenti che li costituiscono. Comprendere questi componenti e le loro affordance associate fornisce preziose intuizioni sulla funzionalità di un oggetto, che è fondamentale per eseguire un'ampia gamma di compiti. In questo lavoro, introduciamo un nuovo benchmark del mondo reale, InstructPart, che comprende annotazioni di segmentazione delle parti etichettate manualmente e istruzioni orientate ai compiti per valutare le prestazioni dei modelli attuali nella comprensione e nell'esecuzione di compiti a livello di parti in contesti quotidiani. Attraverso i nostri esperimenti, dimostriamo che la segmentazione delle parti orientata ai compiti rimane un problema impegnativo, anche per i modelli visione-linguaggio (VLMs) all'avanguardia. Oltre al nostro benchmark, introduciamo una semplice baseline che ottiene un miglioramento delle prestazioni di due volte attraverso il fine-tuning con il nostro dataset. Con il nostro dataset e benchmark, miriamo a facilitare la ricerca sulla segmentazione delle parti orientata ai compiti e a migliorare l'applicabilità dei VLMs in vari domini, tra cui la robotica, la realtà virtuale, il recupero delle informazioni e altri campi correlati. Sito web del progetto: https://zifuwan.github.io/InstructPart/.

English

Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.

InstructPart: Segmentazione Orientata al Compito con Ragionamento sulle Istruzioni

InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning

Abstract

Support