矛盾する証拠を伴う検索拡張生成

要旨

大規模言語モデル（LLM）エージェントは、応答の事実性を向上させるために、検索拡張生成（RAG）をますます活用しています。しかし、実際には、これらのシステムは曖昧なユーザークエリや複数のソースからの潜在的に矛盾する情報を処理しつつ、ノイズや無関係なドキュメントからの不正確な情報を抑制する必要があります。従来の研究では、これらの課題を個別に扱い、曖昧さの処理やノイズ・誤情報に対する頑健性など、一度に一つの側面のみを考慮してきました。我々は代わりに、複数の要因を同時に考慮し、(i) RAMDocs（Retrieval with Ambiguity and Misinformation in Documents）という新しいデータセットを提案します。これは、曖昧さ、誤情報、ノイズを含む、ユーザークエリに対する複雑で現実的な矛盾する証拠のシナリオをシミュレートします。また、(ii) MADAM-RAGというマルチエージェントアプローチを提案します。これは、LLMエージェントが複数ラウンドにわたって回答のメリットについて議論し、曖昧さを解消したエンティティに対応する回答を集約しながら、誤情報やノイズを排除することで、多様な矛盾の源を共同で処理します。我々は、MADAM-RAGの有効性を、閉じたモデルとオープンソースモデルの両方でAmbigDocs（曖昧なクエリに対してすべての有効な回答を提示する必要がある）とFaithEval（誤情報を抑制する必要がある）で実証し、Llama3.3-70B-Instructを使用して、それぞれ最大11.40%と15.80%（絶対値）の改善を示しました。さらに、RAMDocsが既存のRAGベースラインにとって課題となることを発見しました（Llama3.3-70B-Instructは32.60の正確一致スコアしか得られませんでした）。MADAM-RAGはこれらの矛盾する要因に対処し始めていますが、特に支持証拠と誤情報の不均衡レベルを増加させた場合に、依然として大きなギャップが残っていることが分析から示されています。

English

Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs -- which requires presenting all valid answers for ambiguous queries -- improving over strong RAG baselines by up to 11.40% and on FaithEval -- which requires suppressing misinformation -- where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match score). While MADAM-RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.

矛盾する証拠を伴う検索拡張生成

Retrieval-Augmented Generation with Conflicting Evidence

要旨

Support