ChatPaper.aiChatPaper

Rex-Thinker:基於思維鏈推理的實體參照定位

Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

June 4, 2025
作者: Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, Lei Zhang
cs.AI

摘要

物件指稱旨在檢測圖像中所有與給定自然語言描述相匹配的物件。我們主張,一個強健的物件指稱模型應具備基礎性,即其預測應具備可解釋性並忠實於視覺內容。具體而言,它應滿足兩個關鍵特性:1)可驗證性,通過生成可解釋的推理來證明其預測,並將其明確地與視覺證據相聯繫;2)可信賴性,通過學習在圖像中無物件滿足給定表達時選擇棄權。然而,大多數方法將指稱視為直接的邊界框預測任務,提供有限的解釋性,並難以拒絕無匹配物件的表達。在本研究中,我們提出了Rex-Thinker模型,該模型將物件指稱表述為一個明確的CoT推理任務。給定一個指稱表達,我們首先識別出所有與被指稱物件類別相對應的候選物件實例。Rex-Thinker隨後對每個候選物件進行逐步推理,評估其是否與給定表達相匹配,然後做出最終預測。為支持這一範式,我們在HumanRef數據集上通過提示GPT-4o構建了一個大規模的CoT風格指稱數據集,名為HumanRef-CoT。每個推理軌跡遵循結構化的規劃、行動和總結格式,使模型能夠學習對候選物件的分解式、可解釋的推理。我們隨後分兩個階段訓練Rex-Thinker:首先進行冷啟動的監督微調階段,以教會模型如何執行結構化推理;接著進行基於GRPO的強化學習,以提高準確性和泛化能力。實驗表明,我們的方法在域內評估中無論在精確度還是解釋性上均優於標準基線,同時在拒絕虛構輸出和域外設置中的強泛化能力方面也展現出改進。
English
Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.
PDF22June 5, 2025