FREESON:基于语料库遍历蒙特卡洛树搜索的无检索器增强推理方法
FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS
May 22, 2025
作者: Chaeeun Kim, Seungone Kim
cs.AI
摘要
大型推理模型(LRMs)在多步推理及适时调用搜索引擎方面展现了卓越能力。然而,现有的检索增强推理方法依赖于独立的检索模型,将LRM在检索中的角色局限于决定何时检索及如何查询。这种分离不仅增加了硬件和运营成本,还因表示瓶颈现象——即检索器的嵌入空间不足以满足生成器需求——导致检索过程中的错误。为解决这一问题,我们转变视角,从序列到序列的匹配转向在语料库中定位包含答案的路径,并提出了一种名为FREESON(无检索器的检索增强推理)的新框架。该框架使LRMs能够通过同时充当生成器和检索器,自主检索相关知识。为此,我们引入了一种专为检索任务设计的MCTS算法变体,称为CT-MCTS(语料库遍历蒙特卡洛树搜索)。在此算法中,LRMs遍历语料库,向包含答案的区域进发。我们在五个开放域问答基准上的测试结果,包括单跳和多跳问题,显示FREESON在EM和F1指标上平均比四个配备独立检索器的多步推理模型提升了14.4%,并且在最强基线模型上表现相当,在PopQA和2WikiMultihopQA上分别超出3%和2%。
English
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in
multi-step reasoning and calling search engines at appropriate steps. However,
existing retrieval-augmented reasoning approaches rely on separate retrieval
models, limiting the LRM's role in retrieval to deciding when to retrieve and
how to query. This separation not only increases hardware and operational costs
but also leads to errors in the retrieval process due to the representation
bottleneck, a phenomenon where the retriever's embedding space is not
expressive enough to meet the generator's requirements. To address this, we
shift our perspective from sequence-to-sequence matching to locating the
answer-containing paths within the corpus, and propose a novel framework called
FREESON (Retriever-FREE Retrieval-Augmented ReaSONing). This framework enables
LRMs to retrieve relevant knowledge on their own by acting as both a generator
and retriever. To achieve this, we introduce a variant of the MCTS algorithm
specialized for the retrieval task, which we call CT-MCTS (Corpus-Traversing
Monte Carlo Tree Search). In this algorithm, LRMs traverse through the corpus
toward answer-containing regions. Our results on five open-domain QA
benchmarks, including single-hop and multi-hop questions, show that FREESON
achieves an average improvement of 14.4% in EM and F1 over four multi-step
reasoning models with a separate retriever, and it also performs comparably to
the strongest baseline, surpassing it by 3% on PopQA and 2WikiMultihopQA.Summary
AI-Generated Summary