SealQA:提升搜索增强语言模型推理能力的新标杆
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
June 1, 2025
作者: Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, Tu Vu
cs.AI
摘要
我們推出SealQA,這是一個新的挑戰基準,用於評估在網絡搜索產生衝突、噪音或無用結果的事實尋求問題上,增強搜索能力的語言模型。SealQA分為三種類型:(1) Seal-0(主要)和(2) Seal-Hard,這兩者評估事實準確性和推理能力,其中Seal-0專注於最具挑戰性的問題,這些問題通常使聊天模型(例如GPT-4.1)的準確率接近零;以及(3) LongSeal,它將SealQA擴展到測試長上下文、多文檔推理的“大海撈針”情境中。我們的評估揭示了當前模型的關鍵限制:即使是前沿的大型語言模型在所有SealQA類型中表現都不佳。在Seal-0上,配備了o3和o4-mini等工具的前沿代理模型在最佳推理努力下,準確率分別僅為17.1%和6.3%。我們發現,像DeepSeek-R1-671B和o3-mini這樣的高級推理模型對噪音搜索結果極為敏感。值得注意的是,增加測試時的計算資源並未在o3-mini、o4-mini和o3中帶來可靠的性能提升,性能往往在早期就達到平台期甚至下降。此外,雖然最近的模型受“迷失在中間”問題的影響較小,但在LongSeal中面對大量干擾文檔時,它們仍然無法可靠地識別相關文檔。為了促進未來的研究,我們在huggingface.co/datasets/vtllms/sealqa上發布了SealQA。
English
We introduce SealQA, a new challenge benchmark for evaluating
SEarch-Augmented Language models on fact-seeking questions where web search
yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors:
(1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and
reasoning capabilities, with Seal-0 focusing on the most challenging questions
where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3)
LongSeal, which extends SealQA to test long-context, multi-document reasoning
in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations
in current models: Even frontier LLMs perform poorly across all SealQA flavors.
On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini
achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning
efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and
o3-mini are highly vulnerable to noisy search results. Notably, increasing
test-time compute does not yield reliable gains across o3-mini, o4-mini, and
o3, with performance often plateauing or even declining early. Additionally,
while recent models are less affected by the "lost-in-the-middle" issue, they
still fail to reliably identify relevant documents in LongSeal when faced with
numerous distractors. To facilitate future work, we release SealQA at
huggingface.co/datasets/vtllms/sealqa.