Rex-Thinker: チェーン・オブ・思考推論によるグラウンデッドな物体参照

要旨

物体参照タスクは、与えられた自然言語の記述に一致する画像内のすべての物体を検出することを目的としています。我々は、堅牢な物体参照モデルは「グラウンディング」されているべきだと主張します。つまり、その予測は説明可能であり、かつ視覚的内容に忠実であるべきです。具体的には、以下の2つの重要な特性を満たす必要があります：1) 検証可能であること。予測を正当化する解釈可能な推論を生成し、視覚的証拠と明確に結びつけること。2) 信頼できること。与えられた表現に一致する物体が画像内に存在しない場合に、予測を控えることを学習すること。しかし、ほとんどの手法は参照タスクを直接的なバウンディングボックス予測タスクとして扱っており、解釈可能性が限られており、一致する物体がない表現を拒否するのに苦労しています。本研究では、物体参照を明示的なCoT（Chain-of-Thought）推論タスクとして定式化するモデル、Rex-Thinkerを提案します。参照表現が与えられた場合、まず参照される物体カテゴリに対応するすべての候補物体インスタンスを特定します。その後、Rex-Thinkerは各候補に対して段階的な推論を行い、与えられた表現に一致するかどうかを評価し、最終的な予測を行います。このパラダイムをサポートするため、HumanRefデータセット上でGPT-4oにプロンプトをかけて、大規模なCoTスタイルの参照データセットであるHumanRef-CoTを構築しました。各推論トレースは、計画、行動、要約の構造化されたフォーマットに従っており、モデルが物体候補に対して分解された解釈可能な推論を学習できるようにしています。次に、Rex-Thinkerを2段階でトレーニングします：構造化された推論を実行する方法をモデルに教えるためのコールドスタートの教師ありファインチューニングフェーズと、精度と汎化性を向上させるためのGRPOベースの強化学習フェーズです。実験結果は、我々のアプローチがドメイン内評価において精度と解釈可能性の両方で標準的なベースラインを上回り、また、幻覚出力を拒否する能力が向上し、ドメイン外設定においても強い汎化性を示すことを示しています。

English

Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.

Rex-Thinker: チェーン・オブ・思考推論によるグラウンデッドな物体参照

Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

要旨

Support