InstructPart : Segmentation orientée tâche de parties avec raisonnement par instructions

papers.abstract

Les grands modèles fondationnels multimodaux, en particulier dans les domaines du langage et de la vision, ont considérablement fait progresser diverses tâches, notamment la robotique, la conduite autonome, la recherche d'information et l'ancrage. Cependant, beaucoup de ces modèles perçoivent les objets comme indivisibles, négligeant les composants qui les constituent. Comprendre ces composants et leurs affordances associées fournit des informations précieuses sur la fonctionnalité d'un objet, ce qui est fondamental pour accomplir une large gamme de tâches. Dans ce travail, nous introduisons un nouveau benchmark en conditions réelles, InstructPart, comprenant des annotations de segmentation de parties étiquetées manuellement et des instructions orientées tâches pour évaluer la performance des modèles actuels dans la compréhension et l'exécution de tâches au niveau des parties dans des contextes quotidiens. À travers nos expériences, nous démontrons que la segmentation de parties orientée tâches reste un problème complexe, même pour les modèles vision-langage (VLMs) de pointe. En plus de notre benchmark, nous introduisons une base de référence simple qui permet d'améliorer les performances par un facteur deux grâce à un affinage avec notre jeu de données. Avec notre jeu de données et notre benchmark, nous visons à faciliter la recherche sur la segmentation de parties orientée tâches et à améliorer l'applicabilité des VLMs dans divers domaines, notamment la robotique, la réalité virtuelle, la recherche d'information et d'autres domaines connexes. Site web du projet : https://zifuwan.github.io/InstructPart/.

English

Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.

InstructPart : Segmentation orientée tâche de parties avec raisonnement par instructions

InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning

papers.abstract

Support