A^2Search: 강화 학습 기반 모호성 인식 질의응답

초록

대규모 언어 모델(LLMs)과 강화 학습(RL)의 최근 발전으로 인해 개방형 질문 응답(QA) 분야에서 강력한 성능이 나타나고 있습니다. 그러나 기존 모델들은 여전히 여러 유효한 답변이 가능한 질문에 어려움을 겪고 있습니다. 일반적으로 단일 정답을 가정하는 표준 QA 벤치마크는 이러한 현실을 간과하여 부적절한 학습 신호를 생성합니다. 모호성을 처리하기 위한 기존의 시도들은 대부분 비용이 많이 드는 수동 주석에 의존하며, HotpotQA 및 MuSiQue와 같은 다중 홉 데이터셋으로 확장하기 어렵습니다. 본 논문에서는 모호성을 인식하고 처리하기 위한 주석이 필요 없는 종단 간 학습 프레임워크인 A^2Search를 제안합니다. 이 프레임워크의 핵심은 모호한 질문을 자동으로 감지하고 궤적 샘플링 및 증거 검증을 통해 대체 답변을 수집하는 자동화된 파이프라인입니다. 그런 다음, 다중 답변을 자연스럽게 수용하도록 설계된 AnsF1 보상을 사용하여 RL로 모델을 최적화합니다. 8개의 개방형 QA 벤치마크에서의 실험 결과, A^2Search는 새로운 최첨단 성능을 달성했습니다. 단일 롤아웃만으로도 A^2Search-7B는 4개의 다중 홉 벤치마크에서 평균 AnsF1@1 점수 48.4%를 기록하며, ReSearch-32B(46.2%)를 포함한 모든 강력한 베이스라인을 능가했습니다. 광범위한 분석은 A^2Search가 모호성을 해결하고 벤치마크 간 일반화를 잘 수행함을 보여주며, 더 신뢰할 수 있는 QA 시스템을 구축하기 위해서는 모호성을 수용하는 것이 필수적임을 강조합니다. 우리의 코드, 데이터 및 모델 가중치는 https://github.com/zfj1998/A2Search에서 확인할 수 있습니다.

English

Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A^2Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed AnsF1 reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A^2Search achieves new state-of-the-art performance. With only a single rollout, A^2Search-7B yields an average AnsF1@1 score of 48.4% across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B (46.2%). Extensive analyses further show that A^2Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search

A^2Search: 강화 학습 기반 모호성 인식 질의응답

A^2Search: Ambiguity-Aware Question Answering with Reinforcement Learning

초록

Support