SegAgent: 인간 주석자 궤적 모방을 통해 MLLM의 픽셀 이해 능력 탐구

초록

MLLM(Multimodal Large Language Model)은 이미지 이해 능력에서 적절한 성능을 보여주었지만, 여전히 픽셀 수준의 이해에는 어려움을 겪고 있어 실질적인 응용에 제약이 있습니다. 현재의 평가 과제인 VQA(Visual Question Answering)와 시각적 그라운딩(visual grounding)은 미세한 픽셀 이해를 정확히 평가하기에는 너무 거친 수준입니다. 세그멘테이션(segmentation)은 픽셀 수준 이해의 기초이지만, 기존 방법들은 MLLM이 외부 픽셀 디코더를 통해 디코딩되는 암묵적 토큰(implicit tokens)을 생성하도록 요구합니다. 이 접근 방식은 MLLM의 텍스트 출력 공간을 방해하여 언어 능력을 저해할 가능성이 있으며, 유연성과 확장성을 감소시키는 동시에 모델의 내재적 픽셀 수준 이해를 제대로 반영하지 못합니다. 이에 우리는 인간 주석자처럼 인터랙티브 세그멘테이션 도구를 사용하는 새로운 패러다임인 Human-Like Mask Annotation Task(HLMAT)를 제안합니다. HLMAT는 세그멘테이션을 다단계 마르코프 결정 과정(Markov Decision Process)으로 모델링하여, MLLM이 텍스트 기반 클릭 포인트를 반복적으로 생성하도록 함으로써 아키텍처 변경이나 암묵적 토큰 없이도 고품질 마스크를 달성합니다. 이를 통해 인간과 유사한 주석 궤적(human-like annotation trajectories)에 미세 조정된 SegAgent 모델을 개발했으며, 이 모델은 최신 기술(state-of-the-art, SOTA)과 비슷한 성능을 보이면서 마스크 정제(mask refinement) 및 주석 필터링(annotation filtering)과 같은 추가 작업도 지원합니다. HLMAT는 MLLM의 미세한 픽셀 이해를 평가하기 위한 프로토콜을 제공하며, MLLM의 시각적 추론 능력을 탐구할 수 있는 시각 중심의 다단계 의사결정 과제를 도입합니다. 정책 개선 방법인 StaR(Self-Training with Reinforcement)와 PRM(Probabilistic Roadmap) 기반 트리 탐색(tree search)을 적용하여 복잡한 세그멘테이션 작업에서 모델의 견고성을 더욱 강화했으며, 이를 통해 MLLM의 미세한 시각적 인식과 다단계 의사결정 분야의 미래 발전을 위한 기반을 마련했습니다.

English

While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM's text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model's intrinsic pixel-level understanding. Thus, we introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to state-of-the-art (SOTA) methods and supports additional tasks like mask refinement and annotation filtering. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs' visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM-guided tree search further enhance model robustness in complex segmentation tasks, laying a foundation for future advancements in fine-grained visual perception and multi-step decision-making for MLLMs.

SegAgent: 인간 주석자 궤적 모방을 통해 MLLM의 픽셀 이해 능력 탐구

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

초록

Support