Rex-Thinker：基于思维链推理的物体指代系统

摘要

目标指代旨在检测图像中所有与给定自然语言描述相匹配的对象。我们认为，一个鲁棒的目标指代模型应当具备基础性，即其预测既应可解释，又需忠实于视觉内容。具体而言，它应满足两个关键特性：1）可验证性，通过生成可解释的推理过程来证明其预测，并明确将其与视觉证据相关联；2）可信赖性，当图像中无对象符合给定描述时，能够学会放弃预测。然而，大多数方法将指代视为直接的边界框预测任务，提供的可解释性有限，且在拒绝无匹配对象的描述时表现欠佳。在本研究中，我们提出了Rex-Thinker模型，它将目标指代明确表述为一种链式思维（CoT）推理任务。给定一个指代表达式，我们首先识别出所有与所指对象类别对应的候选实例。随后，Rex-Thinker对每个候选对象进行逐步推理，评估其是否匹配给定表达式，最终做出预测。为支持这一范式，我们通过在HumanRef数据集上提示GPT-4o，构建了一个大规模CoT风格的指代数据集HumanRef-CoT。每个推理轨迹遵循规划、行动和总结的结构化格式，使模型能够学习对候选对象进行分解且可解释的推理。接着，我们分两阶段训练Rex-Thinker：首先进行冷启动的监督微调，教导模型如何执行结构化推理；随后基于GRPO的强化学习，以提高准确性和泛化能力。实验表明，我们的方法在域内评估中，在精度和可解释性上均优于标准基线，同时在拒绝幻觉输出和域外设置中的强泛化能力方面也展现出改进。

English

Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.

Rex-Thinker：基于思维链推理的物体指代系统

Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

摘要

Support