AceSearcher: 強化学習による自己対戦を用いたLLMの推論と検索能力のブートストラップ

要旨

検索拡張型の大規模言語モデル（LLM）は、複雑な推論タスクにおいて、多段階の検索が非効率的であることや推論能力が限られていることから、しばしば苦戦を強いられます。本研究では、AceSearcherを提案します。これは、単一のLLMを訓練し、複雑なクエリを分解する「分解者」と、検索された文脈を統合して回答を生成する「解決者」という2つの役割を交互に担わせる協調的セルフプレイフレームワークです。AceSearcherは、検索、推論、分解タスクの多様な混合データセットに対する教師ありファインチューニングと、最終的な回答精度を最適化する強化学習ファインチューニングを組み合わせることで、中間アノテーションの必要性を排除します。10のデータセットにわたる3つの推論集約型タスクでの広範な実験により、AceSearcherが最先端のベースラインを上回り、平均で7.6%の正確一致率向上を達成することが示されました。特に、ドキュメントレベルの財務推論タスクでは、AceSearcher-32BがDeepSeek-V3モデルの性能に匹敵し、そのパラメータ数の5%未満で同等の結果を達成しました。さらに、小規模なモデル（1.5Bおよび8B）においても、AceSearcherは既存の検索拡張型LLMをしばしば上回り、最大9倍のパラメータを持つモデルを凌駕するなど、複雑な推論タスクに取り組む際の卓越した効率性と有効性が明らかになりました。私たちのコードは、https://github.com/ritaranx/AceSearcher および https://huggingface.co/AceSearcher で公開される予定です。

English

Search-augmented LLMs often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceSearcher, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceSearcher couples supervised fine-tuning on a diverse mixture of search, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceSearcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%. Remarkably, on document-level finance reasoning tasks, AceSearcher-32B matches the performance of the DeepSeek-V3 model using less than 5% of its parameters. Even at smaller scales (1.5B and 8B), AceSearcher often surpasses existing search-augmented LLMs with up to 9x more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks. Our code will be published at https://github.com/ritaranx/AceSearcher and https://huggingface.co/AceSearcher.

AceSearcher: 強化学習による自己対戦を用いたLLMの推論と検索能力のブートストラップ

AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play

要旨

Support