InstructPart: 지시 기반 추론을 통한 작업 지향적 부분 분할

초록

대규모 멀티모달 기반 모델, 특히 언어 및 비전 분야에서의 모델은 로보틱스, 자율 주행, 정보 검색, 그리고 그라운딩 등 다양한 작업에서 상당한 진전을 이루었습니다. 그러나 이러한 모델 중 많은 수가 객체를 나눌 수 없는 단위로 인식하여, 그 객체를 구성하는 구성 요소들을 간과하는 경향이 있습니다. 이러한 구성 요소들과 그에 연관된 어포던스(affordance)를 이해하는 것은 객체의 기능성을 파악하는 데 있어 중요한 통찰을 제공하며, 이는 다양한 작업을 수행하는 데 기본이 됩니다. 본 연구에서는 일상적인 맥락에서 부위 수준의 작업을 이해하고 실행하는 데 있어 현재 모델들의 성능을 평가하기 위해, 수작업으로 레이블이 지정된 부위 분할 주석과 작업 지향적 지침으로 구성된 새로운 실세계 벤치마크인 InstructPart를 소개합니다. 우리의 실험을 통해 작업 지향적 부위 분할이 최첨단 비전-언어 모델(VLMs)에게도 여전히 어려운 문제임을 입증했습니다. 또한, 우리의 벤치마크와 함께, 우리의 데이터셋을 활용한 미세 조정을 통해 성능을 두 배로 향상시킨 간단한 베이스라인을 제시합니다. 우리의 데이터셋과 벤치마크를 통해, 작업 지향적 부위 분할 연구를 촉진하고, 로보틱스, 가상 현실, 정보 검색 및 기타 관련 분야에서 VLMs의 적용 가능성을 높이고자 합니다. 프로젝트 웹사이트: https://zifuwan.github.io/InstructPart/.

English

Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.

InstructPart: 지시 기반 추론을 통한 작업 지향적 부분 분할

InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning

초록

Support