A^2搜索：基于强化学习的歧义感知问答系统

摘要

近期，大规模语言模型（LLMs）与强化学习（RL）的进展显著提升了开放域问答（QA）的性能。然而，现有模型在面对允许多个有效答案的问题时仍显吃力。标准的QA基准测试通常假设存在单一标准答案，忽视了这一现实，从而产生了不恰当的训练信号。现有处理模糊性的尝试多依赖于成本高昂的手动标注，难以扩展至如HotpotQA和MuSiQue等多跳数据集。本文提出A^2Search，一种无需标注、端到端的训练框架，旨在识别并处理模糊性。其核心是一个自动化流程，通过轨迹采样与证据验证检测模糊问题并收集替代答案。模型随后利用精心设计的AnsF1奖励进行RL优化，该奖励自然适应多答案场景。在八个开放域QA基准测试上的实验表明，A^2Search实现了新的最先进性能。仅需单次迭代，A^2Search-7B在四个多跳基准测试上的平均AnsF1@1得分达到48.4%，超越了包括规模更大的ReSearch-32B（46.2%）在内的所有强基线。深入分析进一步显示，A^2Search能够解决模糊性并在不同基准间泛化，强调接纳模糊性对于构建更可靠的QA系统至关重要。我们的代码、数据及模型权重可在https://github.com/zfj1998/A2Search获取。

English

Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A^2Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed AnsF1 reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A^2Search achieves new state-of-the-art performance. With only a single rollout, A^2Search-7B yields an average AnsF1@1 score of 48.4% across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B (46.2%). Extensive analyses further show that A^2Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search

A^2搜索：基于强化学习的歧义感知问答系统

A^2Search: Ambiguity-Aware Question Answering with Reinforcement Learning

摘要

Support