A^2Search：基於強化學習的歧義感知問答系統

摘要

大型語言模型（LLMs）與強化學習（RL）的最新進展，已在開放領域問答（QA）任務中展現出卓越性能。然而，現有模型在面對允許多種有效答案的問題時仍顯吃力。標準的QA基準測試通常假設存在單一正確答案，忽視了這一現實，從而產生了不恰當的訓練信號。現有處理模糊性的嘗試多依賴於成本高昂的手動標註，這在擴展至如HotpotQA和MuSiQue等多跳數據集時面臨挑戰。本文中，我們提出了A^2Search，這是一個無需標註、端到端的訓練框架，旨在識別並處理模糊性。其核心是一個自動化流程，通過軌跡採樣和證據驗證來檢測模糊問題並收集替代答案。模型隨後利用精心設計的AnsF1獎勵進行RL優化，該獎勵自然適應多種答案。在八個開放領域QA基準測試上的實驗表明，A^2Search達到了新的最優性能。僅需一次滾動，A^2Search-7B在四個多跳基準測試上的平均AnsF1@1得分為48.4%，超越了所有強基線，包括規模大得多的ReSearch-32B（46.2%）。深入分析進一步顯示，A^2Search能夠解決模糊性並在基準測試間泛化，強調了擁抱模糊性對於構建更可靠QA系統的重要性。我們的代碼、數據及模型權重可於https://github.com/zfj1998/A2Search 獲取。

English

Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A^2Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed AnsF1 reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A^2Search achieves new state-of-the-art performance. With only a single rollout, A^2Search-7B yields an average AnsF1@1 score of 48.4% across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B (46.2%). Extensive analyses further show that A^2Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search

A^2Search：基於強化學習的歧義感知問答系統

A^2Search: Ambiguity-Aware Question Answering with Reinforcement Learning

摘要

Support