FREESON:基於語料庫遍歷MCTS的無檢索器增強推理
FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS
May 22, 2025
作者: Chaeeun Kim, Seungone Kim
cs.AI
摘要
大型推理模型(LRMs)在多步推理及適時調用搜索引擎方面展現了卓越的能力。然而,現有的檢索增強推理方法依賴於獨立的檢索模型,這限制了LRM在檢索中的角色,僅限於決定何時檢索及如何查詢。這種分離不僅增加了硬件和運營成本,還因表示瓶頸現象導致檢索過程中的錯誤,即檢索器的嵌入空間不足以滿足生成器的需求。為解決這一問題,我們將視角從序列到序列的匹配轉向定位語料庫中包含答案的路徑,並提出了一個名為FREESON(無檢索器的檢索增強推理)的新框架。該框架使LRM能夠作為生成器和檢索器,自主檢索相關知識。為實現這一點,我們引入了一種專為檢索任務設計的MCTS算法變體,稱為CT-MCTS(語料庫遍歷蒙特卡洛樹搜索)。在此算法中,LRM遍歷語料庫,尋找包含答案的區域。我們在五個開放域QA基準測試(包括單跳和多跳問題)上的結果顯示,FREESON在EM和F1指標上平均比使用獨立檢索器的四種多步推理模型提升了14.4%,並且與最強的基線模型表現相當,在PopQA和2WikiMultihopQA上分別超出3%和2%。
English
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in
multi-step reasoning and calling search engines at appropriate steps. However,
existing retrieval-augmented reasoning approaches rely on separate retrieval
models, limiting the LRM's role in retrieval to deciding when to retrieve and
how to query. This separation not only increases hardware and operational costs
but also leads to errors in the retrieval process due to the representation
bottleneck, a phenomenon where the retriever's embedding space is not
expressive enough to meet the generator's requirements. To address this, we
shift our perspective from sequence-to-sequence matching to locating the
answer-containing paths within the corpus, and propose a novel framework called
FREESON (Retriever-FREE Retrieval-Augmented ReaSONing). This framework enables
LRMs to retrieve relevant knowledge on their own by acting as both a generator
and retriever. To achieve this, we introduce a variant of the MCTS algorithm
specialized for the retrieval task, which we call CT-MCTS (Corpus-Traversing
Monte Carlo Tree Search). In this algorithm, LRMs traverse through the corpus
toward answer-containing regions. Our results on five open-domain QA
benchmarks, including single-hop and multi-hop questions, show that FREESON
achieves an average improvement of 14.4% in EM and F1 over four multi-step
reasoning models with a separate retriever, and it also performs comparably to
the strongest baseline, surpassing it by 3% on PopQA and 2WikiMultihopQA.Summary
AI-Generated Summary