특정 개인을 언급함

초록

인간은 컴퓨터 비전에서 의심할 여지 없이 가장 중요한 참여자이며, 자연어 설명을 통해 특정 개인을 탐지하는 능력은 우리가 '사람 참조(person referring)'로 정의한 작업으로서 상당한 실용적 가치를 지닙니다. 그러나 기존 모델들은 일반적으로 실세계에서의 사용성을 달성하지 못하고 있으며, 현재 벤치마크들은 일대일 참조에 초점을 맞춤으로써 이 분야의 진전을 저해하고 있습니다. 본 연구에서는 이 작업을 세 가지 중요한 관점에서 재검토합니다: 작업 정의, 데이터셋 설계, 그리고 모델 아키텍처. 먼저, 참조 가능한 개체의 다섯 가지 측면과 이 작업의 세 가지 독특한 특성을 식별합니다. 다음으로, 이러한 도전 과제를 해결하고 실세계 응용을 더 잘 반영하기 위해 HumanRef라는 새로운 데이터셋을 소개합니다. 모델 설계 관점에서는, 다중모드 대형 언어 모델을 객체 탐지 프레임워크와 통합하여 RexSeek이라는 강력한 참조 모델을 구축합니다. 실험 결과, RefCOCO/+/g와 같은 일반적으로 사용되는 벤치마크에서 우수한 성능을 보이는 최첨단 모델들은 다수의 개인을 탐지하지 못해 HumanRef에서 어려움을 겪는 반면, RexSeek은 사람 참조에서 뛰어난 성능을 보일 뿐만 아니라 일반 객체 참조에도 효과적으로 일반화되어 다양한 인식 작업에 광범위하게 적용 가능함을 보여줍니다. 코드는 https://github.com/IDEA-Research/RexSeek에서 확인할 수 있습니다.

English

Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at https://github.com/IDEA-Research/RexSeek