思考中の検索と精緻化：大規模言語モデルの自律的検索拡張推論

要旨

大規模言語モデルは印象的な推論能力を示しているが、その知識リソースに本質的な制限がある。検索拡張推論は、LLMが外部リソースを照会することを可能にすることでこの制限を緩和するが、既存の手法では無関係な情報やノイズの多い情報を取得することが多く、正確な推論を妨げている。本論文では、新しい「思考中に検索し精緻化する」パラダイムを採用した強化学習によるポストトレーニングフレームワークであるAutoRefineを提案する。AutoRefineは、連続する検索呼び出しの間に明示的な知識精緻化ステップを導入し、モデルが回答を生成する前に証拠を反復的にフィルタリング、蒸留、整理することを可能にする。さらに、グループ相対ポリシー最適化を使用して、回答の正確性に対する報酬とともに、検索固有の報酬を組み込む。単一ホップおよびマルチホップのQAベンチマークでの実験により、AutoRefineが既存のアプローチを大幅に上回り、特に複雑なマルチホップ推論シナリオで優れていることが示された。詳細な分析により、AutoRefineが頻繁に高品質な検索を発行し、証拠を効果的に統合していることが明らかになった。

English

Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new ``search-and-refine-during-think'' paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.

思考中の検索と精緻化：大規模言語モデルの自律的検索拡張推論

Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs

要旨

Support