透過下一點預測實現萬物檢測

摘要

长期以来，目标检测领域一直由传统的基于坐标回归的模型主导，如YOLO、DETR和Grounding DINO。尽管近期有尝试利用多模态大语言模型（MLLMs）来解决这一任务，但它们面临着召回率低、预测重复、坐标不对齐等挑战。在本研究中，我们填补了这一空白，提出了Rex-Omni，一个拥有30亿参数规模的MLLM，实现了最先进的目标感知性能。在COCO和LVIS等基准测试中，Rex-Omni在零样本设置下的表现与基于回归的模型（如DINO、Grounding DINO）相当甚至超越。这一成就得益于三项关键设计：1）任务公式化：我们使用特殊令牌来表示从0到999的量化坐标，降低了模型的学习难度，并提高了坐标预测的令牌效率；2）数据引擎：我们构建了多个数据引擎，以生成高质量的定位、引用和指向数据，为训练提供了语义丰富的监督；3）训练流程：我们采用了两阶段训练过程，结合了在2200万数据上的监督微调与基于GRPO的强化后训练。这一强化学习后训练利用几何感知奖励，有效弥合了离散到连续坐标预测的差距，提高了框的准确性，并缓解了初始SFT阶段教师引导性质带来的预测重复等不良行为。除了传统检测外，Rex-Omni固有的语言理解能力使其具备了多样化的功能，如对象引用、指向、视觉提示、GUI定位、空间引用、OCR和关键点定位，所有这些都在专用基准上进行了系统评估。我们相信，Rex-Omni为更通用和语言感知的视觉感知系统铺平了道路。

English

Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.

透過下一點預測實現萬物檢測

Detect Anything via Next Point Prediction

摘要

Support