通过下一位置预测实现万物检测
Detect Anything via Next Point Prediction
October 14, 2025
作者: Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, Lei Zhang
cs.AI
摘要
长期以来,目标检测领域一直由传统的基于坐标回归的模型主导,如YOLO、DETR和Grounding DINO。尽管近期有研究尝试利用多模态大语言模型(MLLMs)来解决这一任务,但它们面临着召回率低、预测重复、坐标错位等挑战。在本研究中,我们填补了这一空白,提出了Rex-Omni,一个拥有30亿参数规模的MLLM,实现了最先进的物体感知性能。在COCO和LVIS等基准测试中,Rex-Omni在零样本设置下达到了与回归模型(如DINO、Grounding DINO)相当甚至超越的表现。这一成就得益于三大关键设计:1)任务定义:我们采用特殊令牌来表示0到999的量化坐标,降低了模型的学习难度,并提高了坐标预测的令牌效率;2)数据引擎:我们构建了多个数据引擎,生成高质量的定位、引用和指向数据,为训练提供了语义丰富的监督;3)训练流程:我们采用两阶段训练过程,结合了在2200万数据上的监督微调与基于GRPO的强化学习后训练。这一强化学习后训练利用几何感知奖励,有效弥合了离散到连续坐标预测的差距,提升了框的准确性,并缓解了初始SFT阶段因教师引导特性导致的预测重复等不良行为。除了传统检测,Rex-Omni固有的语言理解能力使其具备了多样化的功能,如物体引用、指向、视觉提示、GUI定位、空间引用、OCR及关键点定位,这些功能均在专用基准上进行了系统评估。我们相信,Rex-Omni为开发更加通用且语言感知的视觉感知系统铺平了道路。
English
Object detection has long been dominated by traditional coordinate
regression-based models, such as YOLO, DETR, and Grounding DINO. Although
recent efforts have attempted to leverage MLLMs to tackle this task, they face
challenges like low recall rate, duplicate predictions, coordinate
misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a
3B-scale MLLM that achieves state-of-the-art object perception performance. On
benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or
exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot
setting. This is enabled by three key designs: 1) Task Formulation: we use
special tokens to represent quantized coordinates from 0 to 999, reducing the
model's learning difficulty and improving token efficiency for coordinate
prediction; 2) Data Engines: we construct multiple data engines to generate
high-quality grounding, referring, and pointing data, providing semantically
rich supervision for training; \3) Training Pipelines: we employ a two-stage
training process, combining supervised fine-tuning on 22 million data with
GRPO-based reinforcement post-training. This RL post-training leverages
geometry-aware rewards to effectively bridge the discrete-to-continuous
coordinate prediction gap, improve box accuracy, and mitigate undesirable
behaviors like duplicate predictions that stem from the teacher-guided nature
of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent
language understanding enables versatile capabilities such as object referring,
pointing, visual prompting, GUI grounding, spatial referring, OCR and
key-pointing, all systematically evaluated on dedicated benchmarks. We believe
that Rex-Omni paves the way for more versatile and language-aware visual
perception systems.