ChatPaper.aiChatPaper

Rex-Thinker:基于思维链推理的物体指代系统

Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

June 4, 2025
作者: Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, Lei Zhang
cs.AI

摘要

目标指代旨在检测图像中所有与给定自然语言描述相匹配的对象。我们认为,一个鲁棒的目标指代模型应当具备基础性,即其预测既应可解释,又需忠实于视觉内容。具体而言,它应满足两个关键特性:1)可验证性,通过生成可解释的推理过程来证明其预测,并明确将其与视觉证据相关联;2)可信赖性,当图像中无对象符合给定描述时,能够学会放弃预测。然而,大多数方法将指代视为直接的边界框预测任务,提供的可解释性有限,且在拒绝无匹配对象的描述时表现欠佳。在本研究中,我们提出了Rex-Thinker模型,它将目标指代明确表述为一种链式思维(CoT)推理任务。给定一个指代表达式,我们首先识别出所有与所指对象类别对应的候选实例。随后,Rex-Thinker对每个候选对象进行逐步推理,评估其是否匹配给定表达式,最终做出预测。为支持这一范式,我们通过在HumanRef数据集上提示GPT-4o,构建了一个大规模CoT风格的指代数据集HumanRef-CoT。每个推理轨迹遵循规划、行动和总结的结构化格式,使模型能够学习对候选对象进行分解且可解释的推理。接着,我们分两阶段训练Rex-Thinker:首先进行冷启动的监督微调,教导模型如何执行结构化推理;随后基于GRPO的强化学习,以提高准确性和泛化能力。实验表明,我们的方法在域内评估中,在精度和可解释性上均优于标准基线,同时在拒绝幻觉输出和域外设置中的强泛化能力方面也展现出改进。
English
Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.
PDF22June 5, 2025